Rationalists, wearing square hats,
Think, in square rooms, 
Looking at the floor,
Looking at the ceiling.
They confine themselves
To right-angled triangles.
If they tried rhomboids, 
Cones, waving lines, ellipses--
As, for example, the ellipse of the half-
Rationalists would wear sombreros.

-- Wallace Stevens, "Six Significant Landscapes"
Quote | Posted on by | Leave a comment

Peter Kane’s net positive energy, CO2-free custom built home

Posted in American Solar Energy Society, climate mitigation, climate policy, decentralized electric power generation, decentralized energy, ecomodernism, energy storage, energy utilities, global warming, investment in wind and solar energy, keep fossil fuels in ground, Mark Jacobson, Massachusetts Clean Energy Center, mitigating climate disruption, National Renewable Energy Laboratory, Peter Kane, solar democracy, solar domination, solar energy, solar power, solar revolution, Talk Solar, technology, the green century, zero carbon | Leave a comment

Rebekah Jones

From Rebekah Joneskeynote at the Data Science for COVID-19:

Florida COVID Action

The COVID Monitor

Google COVID-19 Open Data Project

Posted in epidemiology, ethical ideals, ethics, Rebekah Jones, whistleblowing | Tagged , | Leave a comment

‘How a Plan to Save the Power System Disappeared’

Peter Fairley reports in The Atlantic:

A federal lab found a way to modernize the grid, reduce reliance on coal, and save consumers billions. Then Trump appointees blocked it.

“This article is a collaboration between The Atlantic and InvestigateWest.”

But a transformation of the world’s energy system to zero Carbon sources will come, whether the processes blessed by the Constitution of the United States want it to or not:

What the world needs:

(From M.Z.Jacobson, M.A.Delucchi, M.A.Cameron, et al)


The Earth is approaching 1.5°C global warming, air pollution kills over 7 million people yearly, and limited fossil fuel resources portend social instability. Rapid solutions are needed. We provide Green New Deal roadmaps for all three problems for 143 countries, representing 99.7% of world’s CO2 emissions. The roadmaps call for countries to move all energy to 100% clean, renewable wind-water-solar (WWS) energy, efficiency, and storage no later than 2050 with at least 80% by 2030. We find that countries and regions avoid blackouts despite WWS variability. Worldwide, WWS reduces energy needs by 57.1%, energy costs from $17.7 to $6.8 trillion/year (61%), and social (private plus health plus climate) costs from $76.1 to $6.8 trillion/year (91%) at a capital cost of ∼$73 trillion. WWS creates 28.6 million more long-term, full-time jobs than are lost and needs only 0.17% and 0.48% of land for footprint and space, respectively. Thus, WWS needs less energy, costs less, and creates more jobs than current energy.

Regarding the implied and possible inability of the Constitution of the United States being able to deal with this problem, here is an excerpt from my welcome at the Moakley Federal Courthouse in Boston, celebrating the judicial action and trial Juliana v United States:

For should the plaintiffs of Juliana fail, the last government branch, the judiciary, abdicates responsibility for solving this urgent problem. And so the Constitution will have failed one of its existential requirements: To provide for the common defense. For Nature has laws, too, and we have been breaking them for a long time, ever more intensely. But Nature does not have courts of grievance or redress. Nature just acts. In a catastrophic sea level rise, perhaps triggered by a collapse of a distant ice sheet, Moakley Courthouse itself, the land you stand on would be lost, and all that there [gesturing towards Boston inner harbor].

While disappointing, were Juliana to be overturned, this should not be a reason for despair. It would not mean the Constitution should be replaced. It would just mean it is useless for solving certain kinds of critically important problems. Its failure would imply the Constitution is becoming a dusty, old thing, irrelevant, like the Articles of Confederation are to us, a ceremonial relic. Let’s hope not.

There will be solutions for solving climate in any case, Constitution or not. They may well be horrifically expensive. And, while there’s no solution without first zeroing emissions, solutions will exist. These will lie beyond the Constitution, I hope Chief Justice Roberts and his colleagues understand the import of that.

Posted in Green New Deal, InvestigateWest, investment in wind and solar energy, Joseph Schumpeter, leaving fossil fuels in the ground, Mark Jacobson, Peter Fairley, The Atlantic | Tagged , , , , , , | Leave a comment

‘The virus is their new hoax’

And note that the variant of SARS-CoV-2 which has taken over the world is a more virulent, more damaging, and more infectious variant of the virus which infected Wuhan.

From the viruses’ perspective ….

Hoax. Yeah, right.

Hoax? Yeah, righhhhhhhht.

Posted in American Association for the Advancement of Science, American Statistical Association, an ignorant American public, climate disruption, coronavirus, COVID-19, SARS-CoV-2 | Leave a comment

Good news, and a beacon of progress, with hope for more to come

That’s Sundar Pichai, CEO, Google and Alphabet.

Ørsted : “Love your home”

Posted in afforestation, agrivoltaics, Alphabet, argoecology, Ørsted, being carbon dioxide, Bloomberg New Energy Finance, bridge to somewhere, Buckminster Fuller, carbon dioxide sequestration, climate change, climate disruption, climate education, climate mitigation, climate policy, ecocapitalism, ecomodernism, ecopragmatism, electricity, emissions, engineering, fossil fuel divestment, Global Carbon Project, global warming, global weirding, Green New Deal, greenhouse gases, keep fossil fuels in ground, leaving fossil fuels in the ground, Mark Jacobson, moral leadership, solar domination, solar energy, solar power, solar revolution, Sundar Pichai, sustainability, technology, the green century, wind energy, wind power, zero carbon | Tagged , , , , , , , , | Leave a comment

David Corliss on “Getting started in Data for Good”

Data for Good.

Statistics without Borders.


Posted in Data for Good, DataKind, Statistics without Borders | Leave a comment

Green New Deal Energy Plans for 143 Countries

Impacts of Green New Deal Energy Plans on Grid Stability, Costs, Jobs, Health, and Climate in 143 Countries

  • Mark Z. Jacobson
  • Mark A. Delucchi
  • Mary A. Cameron
  • Indu Priya Manogaran
  • Yanbo Shu
  • Anna-Katharina von Krauland


Global warming, air pollution, and energy insecurity are three of the greatest problems facing humanity. To address these problems, we develop Green New Deal energy roadmaps for 143 countries. The roadmaps call for a 100% transition of all-purpose business-as-usual (BAU) energy to wind-water-solar (WWS) energy, efficiency, and storage by 2050 with at least 80% by 2030. Our studies on grid stability find that the countries, grouped into 24 regions, can match demand exactly from 2050 to 2052 with 100% WWS supply and storage. We also derive new cost metrics. Worldwide, WWS energy reduces end-use energy by 57.1%, aggregate private energy costs from $17.7 to $6.8 trillion/year (61%), and aggregate social (private plus health plus climate) costs from $76.1 to $6.8 trillion/year (91%) at a present value capital cost of ∼$73 trillion. WWS energy creates 28.6 million more long-term, full-time jobs than BAU energy and needs only ∼0.17% and ∼0.48% of land for new footprint and spacing, respectively. Thus, WWS requires less energy, costs less, and creates more jobs than does BAU.

Posted in Mark Jacobson | Leave a comment

Has maintaining economic growth been worth it?

From Our World in Data article “No sign of a health-economy trade-off, quite the opposite“.

Have the countries experiencing the largest economic decline performed better in protecting the nation’s health, as we would expect if there was a trade-off?

The chart here shows the same GDP data along the horizontal axis. Along the vertical axis is the cumulative number of confirmed COVID-19 deaths per million people.

Contrary to the idea of a trade-off, we see that countries which suffered the most severe economic downturns – like Peru, Spain and the UK – are generally among the countries with the highest COVID-19 death rate.

And the reverse is also true: countries where the economic impact has been modest – like Taiwan, South Korea, and Lithuania – have also managed to keep the death rate low.

Notice too that countries with similar falls in GDP have witnessed very different death rates. For instance, compare the US and Sweden with Denmark and Poland. All four countries saw economic contractions of around 8 to 9 percent, but the death rates are markedly different: the US and Sweden have recorded 5 to 10 times more deaths per million.

Clearly, many factors have affected the COVID-19 death rate and the shock to the economy beyond the policy decisions made by each government about how to control the spread of the virus. And the full impacts of the pandemic are yet to be seen.

Posted in coronavirus, COVID-19, economics, epidemiology, pandemic, policy metrics, politics, SARS-CoV-2 | Tagged , , , , | Leave a comment


Updated, 2020-08-31, 13:30 EDT

A bit of an apology … I was working to produce a couple of more blog posts and had hoped to have one out, reporting on the uncertainty clouds for the phase plane plots of COVID-19 deaths I had done earlier, but the Moirai intervened.

My 10 year old HP p6740f Pavilion Windows 7 Home Premium system has gone dark. I won’t recount the travails of the system in its death throes. I worked essentially all the business hours of two weeks to try to bring it back. It was a software death, but it did not directly have anything to do with a virus (hint: it began because the software update of an anti-virus and firewall product went badly awry), and it looked promising for a time. But, then, a side effect of a repair caused some driver complications, and then fixing that caused the system to become unbootable.

There is a replacement on order, but it won’t be here until 23rd September 2020. It’s a custom built Dell Precision 3630 Tower with Intel Core i7-9700K, 32 GB RAM, 1 TB SSD, NVIDIA Quadro RTX4000, 8GB.

I will be running Windows 10 64-bit, and mostly R.

So, apologies for the delay.

Meanwhile, I’m devoting the time to long overdue tooling of textbooks I’ve been wanted to get to do. I may post a write-up about an idea I got during the Data Science for COVID-19 Conference I attended yesterday. I need to develop it a bit first, though.

Update, 2020-08-31, 13:30 EDT

Since the above is a discussion of Which Machine/Which OS, I thought it appropriate to include a response I gave to an R mailing list when someone asked:

I need a new computer. have a friend who is convinced that I have an
aura about me that just kills electronic devices.

Does anyone out there have an opinion about Windows vs. Linux?

Here’s what I said. It records my experience.

This ends up being a pretty personal decision, but here's my advice.

I have used Windows of various flavors, and Linux in a couple of versions. I have also used four or five Unixen, in addition to Linux. I've never spent a lot of time using a Mac, although in many instances most of my colleagues at companies have. It's invariably a cubicle-like environment, so when they have problems, you know. I also have a Chromebook, which is what I am using to write this, and while awaiting the arrival of a new Windows 10 system.

I have used R heavily on both Windows and Linux. On Linux I used it on my desktop, and I still use it on various large servers, now via RStudio, before from the shell. In the case of the servers, I don't have to maintain them, although I sometimes need to put up with peculiarities of their being maintained by others. (I rarely have sudo access, and sometimes someone has to install something for me, or help me install an R package, because the configuration of libraries on the server isn't quite what R expects.)

My experience with Linux desktops is that they seem fine initially, but then, inevitably, one day you need to upgrade to the next version of Ubuntu or whatever, and, for me, then the hell begins. In the last two times I did it, even with help of co-workers, it was so problematic, that I turned the desktop in, and stopped using the Linux.

Prior to my last Linux version, I also seemed to need to spend an increasingly large amount of time doing maintanence and moving things around ... I ran out of R library space once and had to move the entire installation elsewhere. I did, but it took literally 2 days to figured it out.

Yes, if Linux runs out of physical store -- a moment which isn't always predictable -- R freezes. Memory is of course an issue with Windows, but it simply does what, in my opinion, any modern system does and pages out to virtual memory, up to some limit of course. (I always begin my Windows R workspaces with 16 GB of RAM, and have expanded to 40 GB at times.) I have just purchased a new Windows 10 system, was going to get 64 GB of RAM, but, for economy, settled on 32 GB. (I'm semi-retired as well.) My practice on the old Windows 7 system (with 16 GB RAM) was that I purchased a 256 GB SSD and put the paging file there. That's not quite as good as RAM, but it's much better than a mechanical magnetic drive. My new Windows 10 has a 1 TB SSD. I may move my old 256 GB SSD over to the new just as a side store, but will need to observe system cooling limits. The new system is an 8 core Intel I7.

Windows updates are a pain, mostly because they almost always involve a reboot. I *loved* using my Windows 7 past end of support because there were no updates. I always found Windows Office programs to be incredibly annoying, tolerating them because if you exchange documents with the rest of the world, some appreciable fraction will be Word and Excel spreadsheets. That said, I got rid of all my official Microsoft Office and moved to Open Office, which is fine. I also primarily use LaTeX and MikTeX for my own documents authored, and often use R to generate tables and other things for including in the LaTeX.

On the other hand, when using Linux, ultimately YOU are responsible for keeping your libraries and everything else updated. When R updates, and new packages need to be updated, too, the update mechanism for Linux is recompiling from source. You sometimes need to do that for Windows, and Rtools gives you the way, but generally packages are in binary form. This means they are independent of the particular configuration of libraries you have on your system. That's great in my opinion. And easy. Occasionally you'll find an R package which is source only and for some reason doesn't work with Rtools. Then you are sometimes out of luck or need to run the source version of the package, if it's supported, which can be slow. Sometimes, but rarely, source versions aren't supported. I have also found in server environments that administrators are sometimes sloppy about keeping their gcc and other things updated. So at times I couldn't compile R packages because the admin on the server had an out-of-date gcc which produced a buggy version.

Whether Linux or Windows, I often use multi-core for the Monte Carlo calculations I run, whether bootstraps, random forests, or MCMC. I have used JAGS quite a lot but I don't believe it supports multi-core (unless something has changed recently). I use MCMCpack and others.

The media support for Windows is much better than Linux. (At least Ubuntu now *has* some.) And it is work to keep Linux meda properly updated. Still, I don't use Windows Media Player, preferring VLC.

And there are a wealth of programs and software available for Windows.

No doubt, you need a good anti-virus and a good firewall. (Heck, I have that on my Google Pixel 2, too.) I'm moving to the McAfee subscription my wife has for other systems in the house.

Note, while R is my primary computational world, by far, I do run Anaconda Python 3 from time to time. It can be useful for preparing data for consumption by R, given raw files, many with glitches and mistakes. But with the data.table package and other packages in R, I'm finding that's less and less true. The biggest headache of Python is that you need to keep its libraries updated. I also have used Python some times just to access MATPLOTLIB. I prefer R, though, because, like MATLAB, its numerics are better than Python's NUMPY and SCIPY.

As I said, I don't know Mac at all well. But I do know that, when Mac released a new version, somehow the colleagues about me would often degenerate into a couple of days of grumbling and meeting with each other about how they got past or around some stumbling point when updating their systems. Otherwise people seem to like them a lot.

I think all operating systems are deals with the Devil. It's what you put up with and deal with.

As you can see, I opted to go the Windows route again, for probably the next 10 years.


Posted in Dell 3630 Tower, Windows 10, Windows 7 | 2 Comments

Origins of “modern” hypothesis testing

In their interesting article for CHANCE from July 2020, Debra Boka and Harold Wainer cite, in a footnote, that:

In 1710, Dr John Arbuthnot used the number and sex of christenings listed at the bottom of the Bills to prove the existence of God and, in the process,
invented modern hypothesis testing.

This is based upon:

Arbuthnot, J. 1710. An argument for Divine Providence taken from the Constant Regularity in the Births of Both Sexes. Philosophical Transactions of the Royal Society 27, 186–190. London: The Royal Society.

It’s good to know that hypothesis testing and its extreme reduction, significance testing, had an origin in 1710 by invoking the Divine, and not a 1925 invention by Someone Else.

It also appears that Principle explaining the “equality of numbers of the sexes” named for the Someone, was anticipated by Arbuthnot a bit earlier.

Of course we credit Someone for inventing the term “Bayesian”, but he did not do it in tribute.

Posted in Bayesian, hypothesis testing, John Arbuthnot, Ronald W Fisher | Leave a comment

Professor Mark Z Jacobson on 100% zero Carbon energy, at North County Climate Change Alliance

With respect to some of the comments below the video:


Consumerwatchdog.org thinks that 100 million $ from ExxonMobil to fund Stanford and Mark Jacobsens research weakens public trust in his research results.


https://bit.ly/2YdPkmy This report at same site shows that the assertion was incorrect or based upon misunderstandings.


Suing an academic critic isn’t only wrong, it’s also unjust – critical article written by a group of highly regarded experts, Mark jacobson singled out and sued the only author who lacks the legal backing of an institutional employer. Atmospheric scientist and Stanford professor Mark Jacobson escalated an academic dispute out of the peer-reviewed literature and into the courtroom.


Well, because the criticism in PNAS was (a) uncharacteristic of criticism of any other paper there, (b) was rammed through at the last minute without warning, (c) amounted to effectively career and character assassination, and (d) eventually got set aside by independent groups of scholars. In fact, while they were not part of the critique paper in PNAS, some critics of Jacobson, Haegel, et al (2019, see above), came over after evidence over the next couple of years proved the Jacobson team correct.

Posted in carbon dioxide, clean disruption, CleanTechnica, climate change, climate disruption, climate economics, fossil fuels, global warming, investment in wind and solar energy, Mark Jacobson, solar democracy, solar energy, solar power, solar revolution, Tony Seba, wind energy, wind power, zero carbon | Tagged , , , , , , | Leave a comment

“Noam Chomsky wants you to vote for Joe Biden and then haunt his dreams”

At Ink.

You compare those forces, and it looks like, How could this even be a struggle? But that’s the wrong measure. There are people, and they make a difference.

We can go back to my favorite philosopher, David Hume. His Of the First Principles of Government, a political tract in the late 18th century, starts off by saying that we should understand that power is in the hands of the governed. Those who are governed, they’re the ones who have the power. Whatever kind of state it is, militaristic or more democratic, as England was becoming. The masters rule only by consent. And if consent is withdrawn, they lose. Their rule is very fragile.

I should say that today’s masters of the universe, as they modestly call themselves, understand this very well. Every January in Davos, the Switzerland ski resort, the great and powerful gather to go skiing, enjoy themselves, and congratulate each other on how wonderful they are. Top CEOs and media figures and entertainment figures and so on.

But this year was different. The theme was different. The theme was: We’re in trouble. The peasants are coming with their pitchforks. As they would prefer to put it, we’re facing “reputational risks. They’re coming after us. Our control is fragile. We have to provide a different message. So the message at Davos was: Yes, we realize we’ve done wrong things during this whole neoliberal period. You, the general population, have suffered. We understand that. We’re overcoming our mistakes. We’re now going to be committed to you, the stakeholders and working-class communities, we’re really committed to your welfare. We’re becoming deeply humanitarian. We regret our mistakes. You can put your faith in us. We’ll take over and work for your benefit.

Posted in politics | Leave a comment

Isaias and Oak Island, NC

h/t to Professor Rob Young, Program for the Study of Developed Shorelines at Western Carolina University, via LinkedIn.

Posted in coastal communities, coastal investment risks, coasts, hurricanes, Robert Young, shorelines | Leave a comment

Got ya!

Non-Tesla drivers seem not to know (or care?) that because of the capability to do self-driving, all Teslas have a constellation of cameras all about the vehicle. These not only operate while driving, they also monitor activity about Teslas when they are parked, and record extra footage whenever anything approaches the car. They also have some, ahem, creative anti-theft devices built in.

Well, I was returning from a grocery pick up this afternoon in our Tesla Model 3, using the curbside pickup service of Roche Brothers in West Roxbury, MA. Along Route 109 where it crosses Interstate 95, I moved into the right lane, since in Westwood, the left lane drops off. I was not really intending to pass the vehicle I did on the right. It’s just the speed I was going and the speed it was going. However, a little drama ensued as you can see below. This was grabbed off the recording of the Tesla dashcam.

I annotated the license plate, but it zips by pretty fast, so I grabbed it.

Posted in Tesla | Leave a comment

“Inferring change points in the spread of COVID-19 reveals the effectiveness of interventions”

J. Dehning et al., Science 369, eabb9789 (2020). DOI: 10.1126/science.abb9789

Source code and data.

Note: This is not a classical approach to assessing strength of interventions using either counterfactuals or other kinds of causal inference. Accordingly, the argument for the effectiveness of interventions is weaker than it might be.

Posted in American Association for the Advancement of Science, American Statistical Association, Bayesian, Bayesian computational methods, causal inference, causation, changepoint detection, coronavirus, counterfactuals, COVID-19, epidemiology, SARS-CoV-2, state-space models, statistical series, time series | Leave a comment

Larkin Poe Interview


Posted in Larkin Poe | Leave a comment

Indiara Safir: “Amazing Grace”, “Carlos & Maria”, “Time”

Posted in Blues, Indiara Safir | Leave a comment

Political and demographic associations with changes in COVID-19 rates at the tails

A few days ago, I wrote a post about poltical and demographic associations with changes in COVID-19 rates over all U.S. counties. Today, I’m augmenting that. For here, rather than considering all counties, I limited the study to counties with demonstrably large increases or decreases in COVID-19 prevalence.

Rather than working with 3083 counties, this works with 742 of them. Essentially the cutoff was that the response calculated had to exceed 7.02 in magnitude:

Two regressions were repeated. The first was a standard linear regression of response versus all the remaining predictors. The improvement in R^{2} compared with the linear regression done with the complete dataset was remarkable: Adjusted R^{2} was 0.47 rather than 0.16. The second was the random forest regression. Here, too, the improvement in R^{2} was substantial: 0.71 versus 0.54 previously.

Here’s the result of the linear regression:

And, while the number of variables selected in the importance screening did not change, the ones having a consistent (monotonic) effect did. Those contributing to an increase in COVID-19 became:

  • otherpres16
  • otherhouse16
  • hispanic_pct
  • PerCapitaDollars
  • PctChgFrom2017

Those contributing to a decrease in COVID-19 became:

  • repsen16
  • rephouse16
  • black_pct
  • clf_unemploy_pct
  • lesshs_pct
  • lesshs_whites_pct

Note, for contrast, the random forests regression based upon all the counties had those variables contributing to an increase being:

  • otherpres16
  • otherhouse16
  • hispanic_pct
  • PerCapitaDollars
  • PctChgFrom2017

The random forests regression based upon all the counties had these variables:

  • demhouse16
  • black_pct
  • clf_unemploy_pct
  • lesshs_pct
  • lesshs_whites_pct
  • trump.obama.ratio

contributing to a decrease in COVID-19 prevalence.

The current choices make more sense, in addition to being more statistically notable because of the increased R^{2}. But increase in Republican Senate and House support associated with decrease in COVID-19? Hmmmm.

Posted in Five Thirty Eight, Tamino | Leave a comment

So, today, a diversion …

(Updated 2020-07-06, see end.)

So, today I report on a diversion. This is a bit off the beaten path for me. It resulted from an “I wonder” comment by Rob Honeycutt at Tamino‘s “Open Mind” blog, this time offering a post concerning how predictive the party affiliation of a governor of a state might be for the recent performance of the state with respect to COVID-19 (titled “COVID-19: Red States Blue States Redux”).

Rob supposed:

I would think running the numbers by county would be more accurate, though I guess that’s a much bigger task. CA, for example, is obviously predominantly blue, but there are a lot of small red counties which are now seeing a dramatic rise in cases.

And I replied:

Sounds like a logistic regression problem, although I’d/I’ll do it as a classification problem for a random forest. Anyone know where I can get a tabulation of counties vs Red or Blue status? Otherwise I’ll dig around on the Five Thirty Eight GitHub and see if they have one.

Well I didn’t find anything pertinent at Five Thirty Eight, and I didn’t do it as a classification problem, deciding that after the fact.

I found data sources.

One is the tried-and-true Johns Hopkins Github of COVID-19 cases and deaths which I have used elsewhere here. I’ve also discussed sources of COVID-19 data in general.

A second is the U.S. Bureau of Economic Analysis and, specifically, their county level personal income and demographics dataset.

And, rounding out the collection, there is MIT’s ElectionLab which offers datasets with demographics showing county level results through the 2018 General Elections. Their data dictionary is here. This provided the missing classification of Red vs Blue counties.

However, this also lets me tout a basic principle of good data science and statistics. This is something I learned from the books and lectures of Professor Frank Harrell, Jr of Vanderbilt University in Nashville, TN. In particular, Professor Harrell (and others) emphasize(s) the damage premature dichotomization can do to a statistical analysis.

So, instead of categorizing counties as Red vs Blue, or performance on COVID-19 as good vs bad, I am trying to push everything onto ordered measures and, preferably, onto a continuum measure. Indeed, the more I practice data science, the more I am convinced that, in the setup of experiments or data gathering or analysis, if categorical variables or yes/no variables can be at all avoided, they should be. Rather than measuring whether an Internet customer bought something or not, measure the time they spent considering it! Interpreting the duration of dwell on an item is qualitatively different if they buy it or if they don’t. In the former case, they were willing to devote that amount of time investigating. In the latter, it might be they are meticulous and decided against, or it could be the response was too slow and discouraged them. But even in the meticulous case, it shows they were almost to the point of a buy.

So, for assessing how well a county is doing for COVID-19 or not, I took the counts of cases from the county for the last 45 days. I fit a (penalized) smoothing spline to that. I used the smoothing spline to calculate a first derivative for all the days I could. I took the means of those derivatives. And I made that the score of how poorly or well the county was doing. If the mean was positive, the number of cases is going up, and it is doing poorly. If the mean is negative, cases are decreasing.

This all said, why am I clearly drifting my analysis into the world of, oh, politics?

While Five Thirty Eight‘s Nate Silver might claim politics hasn’t as much data available as does sports, a lot of political data exist and are well vetted in comparison to environmental or, as I cringe, Internet data. So it is, in a sense, a joy to work with political data. Most real Internet data, with which in my professional career I have been most deeply associated, needs to be collected by means outside the Internet, and there are interesting investigations and datasets available if that’s done.

Still, a lot of the counts and data pertaining to COVID-19 cases and deaths are fuzzy. We don’t necessarily have a reputable tally of sources in all states of, say, the United States, let alone counties, and let alone countries outside the United States. Assessing reputation depends upon knowing the commitment of authorities to presenting counts aligned with time, authorities which want to be transparent (some don’t: I know, that’s shocking), and knowing that rules for reporting are consistent or efforts are made to allow them to be made consistent even if they are changed. Political authorities tend, from my perspective and for the most part, want to do as little as possible and expend the least budget as possible to observe minimal reporting goals, if that.

The core idea was to develop a rich matrix of predictors or covariates consisting of demographic and other data pertaining to United States counties and match that with a response variable which expresses the overall (mean) slope of increases in COVID-19 cases. While I know counts of cases of COVID-19 aren’t the best measure, with case density per test or population estimates from random sampling being better, it is a widely available measure, and is the measure Tamino chose to use in his original post, namely cases per day per million population. Then, once matched, do a regression of response against predictors and try to develop a model which can be used for variable selection, that is, determining those covariates which are most predictive of increase in COVID-19 cases.

I chose to use a random forests modeling approach, on purely numerical outcome and predictors. The outcome was, as described above, taking the last 45 days of confirmed COVID-19 counts per county, and fitting them with a smoothing spline. Then the first (time) derivative of that spline was calculated per day of the last 45 days. Then the mean value of the first (time) derivative or slope was calculated. That was used as a response. However, the response was not calculated based upon raw counts of cases. Rather, the re-scaled or, as sometimes described, standardized value of case counts over time was used, lest the greater number of people in counties bias their experiences as being more representative. After all, the point is to get a per county picture.

Rescaling was done to covariates, too. In this case this had a beneficial effect: Some of the 3083(*) counties had missing values for some of covariates. Tossing them would have reduced the sample size and imposed a bias because quite a few counties were missing the value for at least one covariate. By rescaling, the mean of the rescaled values became zero, so missing values could be replaced by zeros, and this was equivalent to imputing copies of the mean for the missing values.

What were the covariates? They are described in the data dictionary, but I added a couple more. Here’s the complete list:

Presidential candidate Trump vote total in 2016.
Presidential candidate Clinton vote total in 2016.
Presidential candidate vote total in 2016 for non-major party.
Presidential candidate Romney vote total in 2012.
Presidential candidate Obama vote total in 2012.
Democratic Senate candidate vote total in 2016.
Republican Senate candidate vote total in 2016.
Non-major parties Senate vote total in 2016.
Democratic House candidate vote total in 2016.
Republican House candidate vote total in 2016.
Non-major parties candidate vote total in 2016.
Total population as of 2016.
Non-Hispanic whites as a percentage of total population, 2016.
Non-Hispanic blacks as a percentage of total population, 2016.
Hispanics or Latinos as a percentage of total population, 2016.
Non-whites as a percentage of total population, 2016.
Foreign-born population as a percentage of total population.
Females as a percentage of total population, 2016.
Population 29 years or under as a percentage of total population, 2016.
Population 65 years or older as a percentage of total population, 2016.
Median household income in the past 12 months (in 2016 inflation-adjusted dollars), 2016.
Unemployed population in labor force as a percentage of total population in civilian labor force, 2016.
Population with an education of less than a regular high school diploma as a percentage of total population, 2016.
Population with an education of less than a bachelor’s degree as a percentage of total population, 2016.
White population with an education of less than a regular high school diploma as a percentage of total population, 2016.
White population with an education of less than a bachelor’s degree as a percentage of total population, 2016.
Rural population as a percentage of total population, 2016.
Rural-urban continuum codes. See data dictionary for table of specific code values. These run as positive integers from 1 to 9, with 1 representing, for example, “Counties in metro areas of 1 million population or more” and 9 representing “Completely rural or less than 2,500 urban population, adjacent to a metro area”. These are not continuous values, but they are ordered.
This was a new covariated introduced. Trying to normalize away county population size, this is the ratio of votes in the county in 2016 for presidential candidate Trump to the sum of Clinton and non-major party candidates. Clinton votes are summed with non-majors because there are no votes in some counties for Clinton.
This was a new covariated introduced. Trying to normalize away county population size, this is the ratio of votes in the county for presidential candidate Trump in 2016 to the sum of Obama and non-major party candidates in 2012. Obama votes are summed with non-majors because there are no votes in some counties for Obama.
This was a new covariated introduced. Trying to normalize away county population size, this is the ratio of votes for Republican Senate candidates in the county in 2016 to the sum of votes for Democratic and non-major party candidates. Democratic candidates are summed with non-major parties because there are no votes for Democrats in some counties.
This was a new covariated introduced. Trying to normalize away county population size, this is the ratio of votes for Republican House candidates in the county in 2016 to the sum of votes for Democratic and non-major party candidates. Democratic candidates are summed with non-major parties because there are no votes for Democrats in some counties.
This is from BEA government data. This is the sum of income of residents of the county divided by the number of residents in the county. See the documentation for methodology.
2018 data.
This is from BEA government data. This is the rank of total personal income of residents of the county over all counties in the state to which they are part. Number 1 is the county with the biggest income. 2018 data.
This is the percent change in personal income of residents in a county for 2018 relative to 2017.
(*) Different sources have differing number of counties, and these need to be reconciled. Also, some sources, such as the BEA data, have sub-tabulations to regions and states. There are also errors. In the MIT data, Virginia for example is listed twice, with different values for, e.g., median household income. Finally, spellings of county names need to be reconciled. A few counties in one list are missing in others. 3083 is the number of common counties between the datasets.

Wright and Ziegler‘s ranger package of R was used to explore these data and perform a regression. Varying numbers of trees were tried, and the final result was based upon a forest of 40,000 trees. Two cores were allocated on a 3.2 GHz 4-core Windows 7 system. Up to 6 variables were considered for splitting at each node. The minimum node size was 5, meaning there had to be at least 5 cases passing through each node. The splitting rule was “extra trees”, based upon the option supporting

Geurts, P., Ernst, D., Wehenkel, L. (2006). Extremely randomized trees. Machine Learning 63:3-42. https://doi.org/10.1007/s10994-006-6226-1.

10 random splits were considered for each splitting variable.

There were actually two phases of random forest fitting performed. The first was done to ascertain covariate importance. The second was a regression. Both used 40,000 trees.

In the first phase, permutation importance was used, with importance scaled and local importance values calculated as in:

Breiman, L. (2001). Random forests. Machine Learning, 45:5-32. https://doi.org/10.1023/A:1010933404324.

Permutation importance was used because this was a regression application. The resulting normalized importance scores sorted in decreasing order of importance were:

normalized variable importance
demhouse16 black_pct othersen16 trump.obama.ratio demsen16 repsen16
0.077281 0.069349 0.056643 0.053579 0.051434 0.049935
senate.ratio otherhouse16 lesshs_whites_pct lesshs_pct hispanic_pct rephouse16
0.044154 0.041854 0.037017 0.035589 0.033906 0.032685
nonwhite_pct white_pct obama12 rural_pct clf_unemploy_pct PerCapitaDollars
0.030671 0.030221 0.028720 0.025665 0.025534 0.025463
ruralurban_cc RankInState clinton16 otherpres16 median_hh_inc trump16
0.024347 0.024095 0.023788 0.020571 0.020259 0.018410
foreignborn_pct total_population romney12 lesscollege_whites_pct lesscollege_pct age65andolder_pct
0.016596 0.016517 0.016443 0.015623 0.015530 0.012926
age29andunder_pct PctChgFrom2017 female_pct trump.clinton.ratio house.ratio
0.010310 0.008777 0.006107 0.000000 0.000000

Covariates below the 0.02 quantile were discarded from further consideration. This eliminated female_pct, trump.clinton.ratio, and house.ratio.

Regression was done with the remaining. The fit was barely adequate(**), with an R^{2} = 0.536, and a root-mean-squared error of 4.64. Here is the original mass density of response values:

Recall these were based upon slopes derived from centered and scaled case counts:

scaledDerivativesFrom<- function(setOfSeries)
# Expects cases to be in columns, and the series to be column vectors, that is,
# the successive time points are in successive rows.
nr<- nrow(setOfSeries)
nc<- ncol(setOfSeries)
tocks<- 1:nc
scaledSetOfSeries<- scale(setOfSeries)
centers<- attr(scaledSetOfSeries, "scaled:center")
scalings<- attr(scaledSetOfSeries, "scaled:scale")
fdFromSeries<- suppressWarnings(fdata(mdata=setOfSeries, argvals=tocks))
splining<- suppressWarnings(fdata.deriv(fdataobj=fdFromSeries, nderiv=0, method="bspline",
class.out="fdata", nbasis=basisSize(nr)))
fst<- suppressWarnings(fdata.deriv(fdataobj=fdFromSeries, nderiv=1, method="bspline",
class.out="fdata", nbasis=basisSize(nr)))
fstDerivatives<- colMeans(fst$data)
return(list(fst=fstDerivatives, ctr=centers, sca=scalings))

\text{fdata} and \text{fdata.deriv} are from the fda.usc package.

After this preparation, the point of regression was to determine which covariates contribute to increases in COVID-19 prevalence and which reduce it. In order to determine these, a baseline regression was performed with all rescaled covariates set to their mean values, that is, zeros. Then, because of the scaling, a variation in single standard deviation up or down was increasing or decreasing a covariate value by a unit. Accordingly, two predictions of response was done with all covariates zero apart from a single covariate of interest. All trees in the regression forest were used for these predictions. One prediction was done with the covariate at plus one-half and the other with the covariate at minus one-half. Responses were recorded in each case.

If these responses for each covariate were arranged as minus one-half, baseline, and plus one-half, there are two cases where the responses were monotonic, either monotonically increasing, or monotonically decreasing. If the responses increased monotonically, that meant the covariate was contributing to the increase in COVID-19 incidence. If the responses decreased monotonically, that meant the covariate was impeding growth in COVID-19. All other cases were interpreted as equivocal. A full listing of the responses is given below. There were a couple of covariates which had no significant effect upon response.

The code and datasets for this analysis are available here. I did not place these in my Github because some of the datasets and results (including the random forests) are pretty large.

(**) Slightly more than half of the variance was explained. This certainly is not good enough for prediction forecasting.

In the conclusion most of the covariates did not show a monotonic response. Of those which did, some had surprising effects upon response, contrary to what would be thought. In all cases, it shows that whatever is happening at state levels, as Tamino originally hypothesized, the story at county levels are far more nuanced.

According to the regression, the increases in the following covariates are associated with increases in COVID-19 incidence:

  • Number of votes for non-major party presidential candidates in 2016.
  • Number of votes for non-major party House candidates in 2016.
  • Hispanics or Latinos as a percentage of total population, 2016.
  • Personal per capita income, 2018.
  • Percent change in per capita income from 2017.

According to the regression, increases in the following covariates are associated with decreases in COVID-19 incidence:

  • Number of votes for Democratic party House candidates in 2016.
  • Non-Hispanic blacks as a percentage of total population, 2016.
  • Unemployed population in labor force as a percentage of total population in civilian labor force, 2016.
  • Population with an education of less than a regular high school diploma as a percentage of total population, 2016.
  • White population with an education of less than a regular high school diploma as a percentage of total population, 2016.
  • The ratio of votes in the county for presidential candidate Trump in 2016 to the sum of Obama and non-major party candidates in 2012.

The remainder have non-monotonic effects upon response, with the exception of votes for Trump in 2016, votes for Romney in 2012, and the ratio of Trump votes to sum of Clinton and non-major party Presidential candidates in 2016. The latter three had no significant effect upon response.

These are difficult to interpret, certainly from any common wisdom conventional frame. A story could be told about propensity for supporting non-majority political parties as contrarians. A story could be told about wealth and especially recent wealth. And a more conventional story might be told about support for Democratic House candidates. But the analysis is too weak to build much of a case here. For why is county historical support for Trump over Obama and others associated with decreases in COVID-19?

Another analysis, perhaps using a Bayesian additive regression trees or a generalized additive model might reveal more. I have included the result of a straight regression at the bottom of the post.

The evidence indicates that factors contributing to increases in COVID-19 across counties are more complicated than state political trends suggest.

The table of predicted responses for all regressed covariates is:

decrease covariate baseline increase covariate
trump16 0.585751997130 0.63150294777 0.5393147721024
clinton16 0.636470821891 0.63150294777 0.9463412693325
otherpres16 0.035647289876 0.63150294777 1.2041816952557
romney12 0.559379728900 0.63150294777 0.5688396494840
obama12 0.743567621794 0.63150294777 0.9029509887327
demsen16 0.195284466985 0.63150294777 0.3834763150956
repsen16 0.563399591536 0.63150294777 -0.5634050367543
othersen16 0.482441423494 0.63150294777 -0.2858281838729
demhouse16 1.742885286145 0.63150294777 0.5681492070191
rephouse16 0.390443844733 0.63150294777 0.5296431047217
otherhouse16 0.277166428927 0.63150294777 0.6911570905665
total_population 0.485374266285 0.63150294777 0.4872532869708
white_pct 0.555369233414 0.63150294777 0.3409851865292
black_pct 1.458426115793 0.63150294777 -1.2545806208186
hispanic_pct 0.203970849739 0.63150294777 0.7626913277465
nonwhite_pct 0.306006353575 0.63150294777 0.5322087600115
foreignborn_pct 0.539705099022 0.63150294777 0.4336693730927
female_pct 0.610347729825 0.63150294777 0.4323089313331
age29andunder_pct 0.472453265562 0.63150294777 0.5049536737941
age65andolder_pct 0.403374548917 0.63150294777 0.5885353905113
median_hh_inc 0.456473898331 0.63150294777 0.6011229846262
clf_unemploy_pct 0.761125923726 0.63150294777 -0.0129693833895
lesshs_pct 0.770957215281 0.63150294777 -0.0098785432455
lesscollege_pct 0.556661896877 0.63150294777 0.3748009374647
lesshs_whites_pct 0.749246848692 0.63150294777 -0.1558774570007
lesscollege_whites_pct 0.521495353443 0.63150294777 0.4123318092143
rural_pct 0.602826246670 0.63150294777 0.2588820550810
ruralurban_cc 0.297837745133 0.63150294777 0.3138831021115
trump.clinton.ratio 0.631502947772 0.63150294777 0.6315029477723
trump.obama.ratio 1.037253853907 0.63150294777 0.2651589870262
senate.ratio -0.103308756239 0.63150294777 -0.0583507114855
house.ratio 0.631502947772 0.63150294777 0.6315029477723
PerCapitaDollars -0.111339953111 0.63150294777 0.7183689921657
RankInState 0.439511214174 0.63150294777 0.5414556938257
PctChgFrom2017 -0.525939132513 0.63150294777 0.6810736843380

The following is a result of a linear regression with intercept. Note that the R^{2} is poorer than with the random forest.

Update, 2020-07-06

In the assessment above, I mentioned using BARTs and GAMs as tools for investigating why

  • Number of votes for non-major party presidential candidates in 2016.
  • Number of votes for non-major party House candidates in 2016.
  • Hispanics or Latinos as a percentage of total population, 2016.
  • Personal per capita income, 2018.
  • Percent change in per capita income from 2017.

are associated with increase in COVID-19. After thinking on it for a second day, when I devote time to this problem again, I may instead look at determining the causal effect, if any, of these factors upon increases in COVID-19 incidence, perhaps using the methods of propensity scoring or others. In particular, there are methods in the R package TWANG.

Posted in Five Thirty Eight, Tamino | 1 Comment

Calculating Derivatives from Random Forests

(Comment on prediction intervals for random forests, and links to a paper.)

(Edits to repair smudges, 2020-06-28, about 0945 EDT. Closing comment, 2020-06-30, 1450 EDT.)

There are lots of ways of learning about mathematical constructs, even about actual machines. One way is to push them into a place of improbable application, asking of them things which seem implausible, and watch how they fail. Indeed, they might not fail.

Many of the engineers from the NASA Apollo era spent their spare time with simulators of the CSM and LEM to develop a “feel” for what these machines could do and when they got into trouble. This allowed them to understand their capabilities beyond their formal specifications. This is quite possibly why Jack Garman was able to give the go-ahead for the Apollo 11 lunar landing, given a section of code was patched out, despite the overload of the LEM’s guidance computer from its rendezvous radar being set erroneously on in addition to the doppler landing radar.

Random forests were originally devised by Kleinberg and Ho (1990-2000), and then extended by Breiman, Cutler, Amit, and Geman. Breiman and Cutler are most frequently associated with these important statistical learning devices. They can be used for classification, regression, and variable selection. They can also be used for clustering.

Regression with random forests can be done on the same data that linear or generalized linear regression are done. Whereas the latter two means of calculation offer some continuity conditions which justify how one might use the outputs of their models to calculate, say, derivatives of responses with respect to predictors, it is less obvious how this might be done with random forests.

So, to set the stage, we begin with an application of random forests to time series. In fact, the series in question is reasonably famous, being the Air Passengers dataset introduced by Box & Jenkins and featured in the R package called forecast which, in its most recent rendition, offers the clever function \text{auto.arima} as a means of automatically building ARIMA models for such series.

The plan is to regress with \text{auto.arima} and regress with random forests, and see how the results compare. Then the next step is to figure out how to calculate derivatives from random forest predictions and then uncertainties in those, specifically prediction intervals. (See also on prediction intervals.)

After this post was written, I discovered the exciting paper …

H. Zhang, J. Zimmerman, D. Nettleton, D. J. Nordman, "Random Forest Prediction Intervals", (2019) The American Statistician, DOI: 10.1080/00031305.2019.1585288.

having this supplement.

It is also realized in the standalone R package rfinterval based upon the excellent ranger package.

In this paper, Zhang, Zimmerman, Nettleton, and Nordman overcome the evidence dearth problem encountered by quantile regression forests when trying to estimate prediction intervals, using a technique closely related to the conformal prediction methods in

J. Lei, M. G’Sell, A. Rinaldo, R. J. Tibshirani, L. Wasserman, "Distribution-free predictive inference for regression", (2018) Journal of the American Statistical Association, 113:523, 1094-1111, DOI: 10.1080/01621459.2017.1307116.

Some preliminaries ….

Since the R forecast package has been mentioned, the ranger package should be introduced as the one which will be used to train and test random forests, where, for those unfamiliar, training is a calibration process. Other packages have elements which are critical to this development, namely, the random, coda, and fda.usc packages.

And, incidentally, the code and some supporting files from this blog post are available here.

How can anything be done with a set of dependent data like a time series using random forests? After all, random forest ensemble methods do not in themselves have any inherent applicability to dependent data. Response-covariates sets are considered independent of one another. But, what can be done is mimic the ARIMA setup. Define a set of covariates for a response y_{t} in a series to be the union of the some subset of lagged values of the response, such as \{y_{t-1}, y_{t-2}, \dots, y_{t-p}\}, with some subset of lagged moving averages of the response. These are, in fact, for ARIMA, the basic features of the series which are used, along with some spectral theory, to produce the methods ARIMA employs.

In other words, the matrix of candidate covariates for responses looks like, for p > q:

\left[ \begin{array} {cccccccccc} y_{p} & y_{p-1} & y_{p-2} & \dots & y_{1} & a_{p,2} & a_{p,3} & \dots & a_{p,q)} & \tilde{\mu} \\ y_{1+p} & y_{p} & y_{p-1} & \dots & y_{2} & a_{(1+p),2} & a_{(1+p),3} & \dots & a_{(1+p),q)} & \tilde{\mu} \\ \vdots & \vdots & \vdots & \vdots & \vdots & \vdots & \vdots & \vdots & \vdots \\ y_{n-1} & y_{n-2} & y_{n-3} & \dots & y_{n-p} & a_{(n-1),2} & a_{(n-1),3} & \dots & a_{(n-1),q} & \tilde{\mu} \\ y_{n} & y_{n-1} & y_{n-2} & \dots & y_{1+n-p} & a_{n,2} & a_{n,3} & \dots & a_{n,q} & \tilde{\mu} \\ \end{array} \right]

a_{t,2} = \frac{y_{t-1} + y_{t-2}}{2} \\[0.5cm] a_{t,3} = \frac{y_{t-1} + y_{t-2} + y_{t-3}}{3} \\[0.5cm] \vdots \\[0.5cm] a_{t,q} = \frac{y_{t-1} + y_{t-2} + \dots + y_{1+t-q} + y_{t-q}}{q}

The y_{t-\ell} terms are the vector of p responses corresponding to response y_{t} but lagged by \ell time steps. The a_{t,j} are averages of response values from y_{t-1} through y_{t-j}. \tilde{\mu} is the median of y_{1}, y_{2}, \dots, y_{n}[2].

Why not use ARIMA? Any parametric method brings assumptions to a problem and the point of non-parametric methods, even modern ones like random forests, is to relax the assumptions that otherwise might be needed for estimation and prediction without costing too much. With random forests, there are ancillary benefits, like variable selection. Specifically, rather than choosing an order based upon autocorrelations and the like, as is the practice with random forests work, big a high number of lags and moving averages and let preliminary grows of a random forest indicate which are important for explaining the series.

It’s to be expected there will be drawbacks.

  • The amount of computation required will be higher, because of the need for sampling and resampling for constructing trees.
  • Because a large number of lags are needed to be considered, at least before variable selection is performed, the first parts of series cannot be analyzed as thoroughly as the remainders. More might be analyzed once variable selection indicates that, at least for the tail of the series, some older points of the series do not inform the present.
  • A random forest model may be able to forecast, but it’s capability depends upon many complicated relationships between the in-sample (or training) portion of a series and the out-of-sample (or test) portion. To the degree the ARIMA framework is accepted and, I daresay, works, the premises of the in-sample portion by assumption extend into the out-of-sample portion, and so forecasting can work and be powerful. Should they be violated, however, the ARIMA forecast may be useless because some kind of changepoint is encountered, even if in most systems for which ARIMA works, such violent changes are unlikely. If they are unlikely, there should be some relationship between the character of the in-sample with the out-of-sample. The degree to which these same considerations extend to use of random forests remains, to my knowledge, to be explored. Nonetheless, tuning random forests for out-of-sample performance has produced some notable successes[1].

Nevertheless, it’s easy to see how random forest approaches to analyzing series might be useful. They are insensitive to natures of distributions, they are flexible to use, and, in contrast with some other statistical learning methods, comparatively transparent. Forests trained on different datasets can be merged and consolidated.

[1] See:

  • G. Dudek, “Short-term load forecasting using random forests”, Intelligent Systems 2014., Springer, Cham, 2015, 821-828.
  • A. Lahouar, J. Ben Hadj Slama, “Hour-ahead wind power forecast based on random forests”, Renewable Energy, 2017, 109, 529-541.
  • W. Y. N. Naing, Z. Z. Htike. “Forecasting of monthly temperature variations using random forests”, APRN J. Eng. Appl. Sci, 2015, 10, 10109-10112.

[2] Because \tilde{\mu} is constant for any given series, it won’t be predictive in any fashion so could have been left out of the covariates matrix. However, were the forest trained on additional series, having differing medians might become predictive. Note that this implies both variable selection and regression with a reduced variable set need to be done on new data series. Note as well merging of forests may have to be done with the full original parent set of variables rather than with the reduced set in that case.

Right out of the gate, here’s how one random forest compares with the result from \text{auto.arima}:

(To see larger image, click on figure. It will open in a new window.)

The quality is entirely comparable to the ARIMA result. Note a check of the lags and moving averages picked by \text{auto.arima} shows little overlap with what the random forest elects. What that means for the interpretability of Air Passengers behavior for one or the other of the techniques is beyond the scope of this blog post, but it poses an interesting scientific question.

The forest was produced as a result of a seven step process. Note that prior to building forests or any other step, a replica of the original series is made which is centered and scaled. In other words, the mean, \mu_{y}, and standard deviation, \sigma_{y}, of the original series is calculated, and a transformed version of the series \hat{y}_{t} = \frac{y_{t} - \mu_{y}}{\sigma_{y}} is used in place of the original. This recentering is standard practice for good statistical and numerical properties. Results coming out of the model can be transformed back into the original measure by multiplying them by \sigma_{y} and adding \mu_{y} to the result.

First, a forest of 100,000 trees was grown to ascertain importance of predictors using permutation importance measures. Other details will be given below[3].

Second, a forest of 100,000 trees was grown to do the regression using the selected set of variables, and resulting in the figure above. The resulting R^{2} = 0.996 and the mean-squared-error was 0.0029.

Third, the forest produced in the second step is used to make predictions for all the time points observed. These are used to produce a baseline series for the estimate of Air Passengers over time, and the estimates for that quantity’s first and second time derivatives.

Fourth, the forest produced in the second step is used to make predictions using each of the trees in the forest for all the time points observed. These are used to develop prediction errors in estimates of quantity, and its first and second time derivatives.

Fifth, whenever derivatives need to be estimated, the B-splines technique in the fda.usc package is used.

Sixth, uncertainty estimates in quantity, and its first and second time derivatives are obtained by bootstrapping the full set of individual predictions in the third step above. Each bootstrap of the (here) 200 bootstrap samples results in 100,000 time series predictions, one for a tree in the forest. (Because bootstrapping entails sampling with replacement the same series might be elected multiple times.) The value of the k^{\text{th}} time point in the resulting series is the mean of all 100,000 k^{\text{th}} points in the bootstrap sample. fda.usc splines are used to obtain smooth quantities, and first and second time derivatives from each of the bootstrap samples. To obtain prediction intervals for each of the time points in the series, the built-in R \text{lm} and \text{predict.lm} functions are used to produce them, using the 200 estimates at each time point.

Seventh, results are compiled and plotted.

[3] The number of variables to possibly split at each node was \lceil \sqrt{m}\,\rceil, where m is the number of predictor variables. Initially that was 108 predictors, including 51 lagged responses, 1-51, 53 moving averages, for training windows of sizes 2 to 54, and the median value for the entire series. So, initially, \lceil \sqrt{m}\,\rceil = 11. After variable selection, with normalized variable importance scores having to be greater than 0.005, there was 31 variables used for regression. These included 7 lagged responses and 24 moving averages. Lags 1, 2, 3, and 33-38 were considered important by this criterion. The moving average windows of lengths 2-11, and 38-48 were considered important. This means for the regression, \lceil \sqrt{m}\,\rceil = 6.

The minimum node size specified was 5. The “extra trees” splitting rule was used for both variable selection and regression. Up to 10 random trees per predictor were considered. The depth of trees was unbounded. Inbag errors were kept. 100,000 trees were generated for each forest.

(To see larger image, click on figure. It will open in a new window.)

The uncertainties about the predicted trace are difficult to see because they are small. Recall these are prediction errors. To give an idea, the next figure shows a portion of the same with a lens distortion applied:

(To see larger image, click on figure. It will open in a new window.)

Estimates of first and second time derivatives show similarly small prediction error clouds. Note how the first and second derivatives line up with changes in the number of Air Passengers over time in the figure below.

(To see larger image, click on figure. It will open in a new window.)

The following shows a close-up of the same for a limited interval.

(To see larger image, click on figure. It will open in a new window.)

In short, I conclude this technique to be a highly successful approach to estimating derivatives from random forests, even if it is limited to analyzing given series, and not forecasts.

I have changed my mind regarding that last paragraph. The perspective inherent in the regression forest approach above is atypical for many applications of random forests or, for that matter, statistical learning methods. There’s less emphasis upon prediction as a goal, and more on simply characterizing a given dataset. Accordingly, performance of a forest out-of-sample is less important than fidelity to a series of responses in sample. In this respect, random forests are being used very much like spline interpolators and, so, serve as the basis for calculating other numerical aspects of them, such as derivatives.

Also notable is the idea of bootstrapping the forest in order to get a sense of the internal variability in the model. After all, growing the forest from a given dataset does not produce a unique forest. Successive generations of the random forest will differ. Bootstrapping the forest gives some idea how it may have turned out a second time.

This consideration leads to a couple of other questions appropriate for investigation:

  1. What is the equivalence class of forests look like which all reproduce a given series with less than some specified mean-squared-error?
  2. Given such a class, what is the set of series which are classified by members of the class as being close enough to the original series that the forests will reproduce them with less than some specified mean-squared-error, perhaps the same as the m.s.e. associated with the class, or perhaps larger?

Posted in bridge to somewhere, Calculus, dependent data, dynamic generalized linear models, dynamical systems, ensemble methods, ensemble models, filtering, forecasting, hierarchical clustering, linear regression, model-free forecasting, Monte Carlo Statistical Methods, non-mechanistic modeling, non-parametric model, non-parametric statistics, numerical algorithms, prediction, R statistical programming language, random forests, regression, sampling, splines, statistical learning, statistical series, statistics, time derivatives, time series | Leave a comment

Great podcast: “Confronting uncertainty with Tamsin Edwards”

Dr Tamsin Edwards visits Professor David Spiegelhalter on his “Risky Talk” podcast.

Dr Edwards is a climate scientist with the title Senior Lecturer in Physical Geography at Kings College, London.

There’s much good talk about climate and its associated uncertainties, communicating uncertainty in science, and dealing with uncertainty relating to health in your personal life.

Dr Edwards also writes a blog, All Models Are Wrong.

Posted in alternatives to the Green New Deal, American Association for the Advancement of Science, climate change, climate denial, climate education, climate policy, climate science, David Spiegelhalter, dynamical systems, fluid dynamics, games of chance, global warming, global weirding, IPCC, model comparison, risk, Risky Talk, statistical models, statistical series | Leave a comment

Larkin Poe



Posted in Larkin Poe | Leave a comment


Link | Posted on by | Leave a comment

COVID-19 statistics, a caveat : Sources of data matter

There are a number of sources of COVID-19-related demographics, cases, deaths, numbers testing positive, numbers recovered, and numbers testing negative available. Many of these are not consistent with one another. One could hope at least rates would be consistent, but in a moment I’ll show evidence that even those cannot be counted.

The Johns Hopkins Coronavirus Tracker site is a major project displaying COVID-19-related statistics. They also put their datasets and series on a Github site, which is where I draw data for presentation and analyses here. I do that because I’ve read and understood their approach and methodology, and I’m sticking by the same source over time.

There are other sources of data, including, for example, a Github site operated by the New York Times, and the COVID Tracking Project which is operated in association with Atlantic Magazine. The COVID Tracking Project, as far as I know, does not have a Github, but they offer a data portal.

Here I’m going to compare the series of counts of deaths in the United States from the Johns Hopkins Coronavirus Track and those from the COVID Tracking Project. The two don’t have completely consistent coverage: Johns Hopkins reports lag those reported by COVID Tracking, probably because, as I’ve heard in interviews with some of the principals, they want to check and double-check the counts before releasing them. I imagine all data sites undergo some vetting of counts.

But other complexities intervene. There are questions about whose death statistics does a site want to count? Reports from state governments, results released to the U.S. National Center for Health Statistics by coroners, independent assessments from universities, etc. Do you count presumed COVID-19 deaths, or are all deaths a subset of those actually tested and confirmed positive? Do you get counts from the U.S. Centers for Disease Control?

There are no completely correct ways here. The choices are different, and some may be better than others, and, worse, some of the choices may be better than others at different times. Counting deaths from a pandemic, like doing the U.S. Census, can be a politically charged proposition, if not at the federal level, then in U.S. states. Cities might differ in their estimates of their own reported deaths than their states, perhaps because of different counting rules or perhaps because they are using the counts to manage the pandemic and they want them to reflect a local character of their population.

There are also differences in reporting times. It’s long been known that Tuesdays seem to have a blip of deaths cases, and weekends have a paucity, but that cannot be real. It is an artefact of how the tabulation of deaths is done.

But I was surprised that not only does Johns Hopkins differ from the COVID Tracking Project, the disparity grows over time.

So this shows that the COVID Tracking Project exhibits a growing deficit in deaths versus Johns Hopkins over time. But a closer look not only reveals a deficit, it reveals dynamics. Here’s a time plot of Johns Hopkins U.S. death counts less COVID Tracking Project:

A deficit I can understand, but wiggles and bumps are harder to grok.

So in addition to debates about quality of tests, and whether there are excess deaths whose causes were misdiagnosed, there appears to be a major problem with data quality. That’s not to say that any source is wrong, simply that there are few standards on what should be counted how, and how states should count basic things, like cause of death. This not only has implications for COVID-19 policy, but if there is such variation here, one needs to wonder how good standard actuarial determination from the government are, when they rely upon local sources for basic determinations?

There are ways to do this kind of thing when many sources are used. These include weighting sources depending upon long term performance, a bit like Five Thirty Eight does to political opinion polls of Presidential popularity. The other thing to do, of course, would be to do stratified sampling or cluster sampling estimate some of these quantities, or something else (or something else).

Most importantly, it seems key to me that if inferences and forecasts are to be made with these data, the latent number of deaths per unit time needs to be considered a latent function sampled by these complicated processes, and it needs to be recovered first. No doubt that will be done with error, but large scale week-to-week policy should certainly not be predicated upon considering tallies and counts of these important statistics as literally accurate.

They are compiled by well-meaning, intelligent and educated people, but they are, it seems, under the exigencies of this pandemic, very dirty. There’s a lot here going on which isn’t neatly categorized as one of the usual problems seen in count data regression.

Posted in coronavirus, count data regression, COVID-19, descriptive statistics, epidemiology, pandemic, policy metrics, politics, population biology, population dynamics, quantitative biology, quantitative ecology, sampling, SARS-CoV-2, statistical ecology, statistical series, statistics | 1 Comment

First substantial mechanism for long term immunity from SARS-CoV-2 : T-cells

M. Leslie, “T cells found in COVID-19 patients ‘bode well’ for long-term immunity“, Science, doi:10.1126/science.abc8120.

A. Grifoni, et al, “Targets of T cell responses to SARS-CoV-2 coronavirus in humans with COVID-19 disease and unexposed individuals“, Cell, 14th May 2020.

J. Braun, L. Loyal, et al, “Presence of SARS-CoV-2 reactive T cells in COVID-19 patients and healthy donors“, medRχiv (not peer-reviewed, with comments), 22 April 2020.

34% of non-CV19 patients in Germany already had the Coronavirus spike protein specific helper T-cells. Due most likely to infection with prior common cold coronaviruses. That number is close to the percentage of “asymptomatic cases” we are seeing.

Posted in American Association for the Advancement of Science, coronavirus, COVID-19, epidemiology, SARS-CoV-2 | Tagged | Leave a comment

Larkin Poe: “Southern Cross” cover, from CSNY (and extras)

(Theme from Wallender.)

Posted in Larkin Poe | Leave a comment

Dissection of the Dr Judy Mikovits’ claims in AAAS Science


h/t Dr Katharine Hayhoe @LinkedIn

The Chronic Fatigue Syndrome retraction notice.


Science asked Mikovits for an interview for this article. She responded by sending an empty email with, as attachments, a copy of her new book and a PowerPoint of a 2019 presentation titled “Persecution and Coverup.”

Below are some of the video’s main claims and allegations, along with the facts.

Interviewer: Dr. Judy Mikovits has been called one of the most accomplished scientists of her generation.

Mikovits had authored 40 scientific papers and wasn’t widely known in the scientific community before she published the 2009 Science paper claiming a link between a new retrovirus and CFS. The paper was later proven erroneous and retracted.

Interviewer: Her 1991 doctoral thesis revolutionized the treatment of HIV/AIDS.

Mikovits’s Ph.D. thesis, “Negative Regulation of HIV Expression in Monocytes,” had no discernible impact on the treatment of HIV/AIDS.

Interviewer: At the height of her career, Dr. Mikovits published a blockbuster article in the journal Science. The controversial article sent shock waves through the scientific community, as it revealed that the common use of animal and human fetal tissues was unleashing devastating plagues of chronic diseases.

The paper revealed nothing of the sort; it only claimed to show a link between one condition, CFS, and a mouse retrovirus.


Mikovits: Heads of our entire HHS [Department of Health and Human Services] colluded and destroyed my reputation and the Department of Justice and the [Federal Bureau of Investigation] sat on it, and kept that case under seal.

Mikovits has presented no direct evidence that HHS heads colluded against her.

Mikovits: [Fauci] directed the cover-up. And in fact, everybody else was paid off, and paid off big time, millions of dollars in funding from Tony Fauci and … the National Institute of Allergy and Infectious Diseases. These investigators that committed the fraud, continue to this day to be paid big time by the NIAID.

It’s not clear which fraud and what cover-up Mikovits is talking about exactly. There is no evidence that Fauci was involved in a cover-up or that anyone was paid off with funding from him or his institute. No one has been charged with fraud in relation to Mikovits’s allegations.

Mikovits: It started really when I was 25 years old, and I was part of the team that isolated HIV from the saliva and blood of the patients from France where [virologist Luc] Montagnier had originally isolated the virus. … Fauci holds up the publication of the paper for several months while Robert Gallo writes his own paper and takes all the credit, and of course patents are involved. This delay of the confirmation, you know, literally led to spreading the virus around, you know, killing millions.

At the time of HIV’s discovery, Mikovits was a lab technician in Francis Ruscetti’s lab at NCI and had yet to receive her Ph.D. There is no evidence that she was part of the team that first isolated the virus. Her first published paper, co-authored with Ruscetti, was on HIV and published in May 1986, 2 years after Science published four landmark papers that linked HIV (then called HTLV-III by Gallo’s lab) to AIDS. Ruscetti’s first paper on HIV appeared in August 1985. There is no evidence that Fauci held up either paper or that this led to the death of millions.

Interviewer: If we activate mandatory vaccines globally, I imagine these people stand to make hundreds of billions of dollars that own the vaccines.

Mikovits: And they’ll kill millions, as they already have with their vaccines. There is no vaccine currently on the schedule for any RNA virus that works.

Vaccines have not killed millions; they have saved millions of lives. Many vaccines that work against RNA viruses are on the market, including for influenza, measles, mumps, rubella, rabies, yellow fever, and Ebola.


Interviewer: Do you believe that this virus [SARS-CoV-2] was created in the laboratory?

Mikovits: I wouldn’t use the word created. But you can’t say naturally occurring if it was by way of the laboratory. So it’s very clear this virus was manipulated. This family of viruses was manipulated and studied in a laboratory where the animals were taken into the laboratory, and this is what was released, whether deliberate or not. That cannot be naturally occurring. Somebody didn’t go to a market, get a bat, the virus didn’t jump directly to humans. That’s not how it works. That’s accelerated viral evolution. If it was a natural occurrence, it would take up to 800 years to occur.

Scientific estimates suggest the closest virus to SARS-CoV-2, the virus that causes COVID-19, is a bat coronavirus identified by the Wuhan Institute of Virology (WIV). Its “distance” in evolutionary time to SARS-CoV-2 is about 20 to 80 years. There is no evidence this bat virus was manipulated.

Interviewer: And do you have any ideas of where this occurred?

Mikovits: Oh yeah, I’m sure it occurred between the North Carolina laboratories, Fort Detrick, the U.S. Army Medical Research Institute of Infectious Diseases, and the Wuhan laboratory.

There is no evidence that SARS-CoV-2 originated at WIV. NIAID’s funding of a U.S. group that works with the Wuhan lab has been stopped, which outraged many scientists.

Mikovits: Italy has a very old population. They’re very sick with inflammatory disorders. They got at the beginning of 2019 an untested new form of influenza vaccine that had four different strains of influenza, including the highly pathogenic H1N1. That vaccine was grown in a cell line, a dog cell line. Dogs have lots of coronaviruses.

There is no evidence that links any influenza vaccine, or a dog coronavirus, to Italy’s COVID-19 epidemic.

Mikovits: Wearing the mask literally activates your own virus. You’re getting sick from your own reactivated coronavirus expressions, and if it happens to be SARS-CoV-2, then you’ve got a big problem.

It’s not clear what Mikovits means by “coronavirus expressions.” There is no evidence that wearing a mask can activate viruses and make people sick.

Mikovits: Why would you close the beach? You’ve got sequences in the soil, in the sand. You’ve got healing microbes in the ocean in the salt water. That’s insanity.

It’s not clear what Mikovits means by sand or soil “sequences.” There is no evidence that microbes in the ocean can heal COVID-19 patients.

See the remainder of the article at Science.

Posted in American Association for the Advancement of Science, coronavirus, COVID-19, SARS-CoV-2, science, Science magazine | 3 Comments

We are all Mathematicians

Bobby Seagull.


Posted in mathematics, mathematics education, maths | Leave a comment

“There’s mourning in America”

We are Republicans and we want Trump defeated.”

And the Orange Mango apparently hates this advert.

And that’s why it’s here.

The Lincoln Project apparently introduced this advert on Twitter with the explanatory text:

Since you are awake and trolling the internet, @realDonaldTrump, here is a little bedtime story just for you … good night, Mr. President.#Sleeptight #CountryOverParty

Posted in statistics | Leave a comment