COVID-19 statistics, a caveat : Sources of data matter

Posted on 17 May 2020 by ecoquant

There are a number of sources of COVID-19-related demographics, cases, deaths, numbers testing positive, numbers recovered, and numbers testing negative available. Many of these are not consistent with one another. One could hope at least rates would be consistent, but in a moment I’ll show evidence that even those cannot be counted.

The Johns Hopkins Coronavirus Tracker site is a major project displaying COVID-19-related statistics. They also put their datasets and series on a Github site, which is where I draw data for presentation and analyses here. I do that because I’ve read and understood their approach and methodology, and I’m sticking by the same source over time.

There are other sources of data, including, for example, a Github site operated by the New York Times, and the COVID Tracking Project which is operated in association with Atlantic Magazine. The COVID Tracking Project, as far as I know, does not have a Github, but they offer a data portal.

Here I’m going to compare the series of counts of deaths in the United States from the Johns Hopkins Coronavirus Track and those from the COVID Tracking Project. The two don’t have completely consistent coverage: Johns Hopkins reports lag those reported by COVID Tracking, probably because, as I’ve heard in interviews with some of the principals, they want to check and double-check the counts before releasing them. I imagine all data sites undergo some vetting of counts.

But other complexities intervene. There are questions about whose death statistics does a site want to count? Reports from state governments, results released to the U.S. National Center for Health Statistics by coroners, independent assessments from universities, etc. Do you count presumed COVID-19 deaths, or are all deaths a subset of those actually tested and confirmed positive? Do you get counts from the U.S. Centers for Disease Control?

There are no completely correct ways here. The choices are different, and some may be better than others, and, worse, some of the choices may be better than others at different times. Counting deaths from a pandemic, like doing the U.S. Census, can be a politically charged proposition, if not at the federal level, then in U.S. states. Cities might differ in their estimates of their own reported deaths than their states, perhaps because of different counting rules or perhaps because they are using the counts to manage the pandemic and they want them to reflect a local character of their population.

There are also differences in reporting times. It’s long been known that Tuesdays seem to have a blip of deaths cases, and weekends have a paucity, but that cannot be real. It is an artefact of how the tabulation of deaths is done.

But I was surprised that not only does Johns Hopkins differ from the COVID Tracking Project, the disparity grows over time.

So this shows that the COVID Tracking Project exhibits a growing deficit in deaths versus Johns Hopkins over time. But a closer look not only reveals a deficit, it reveals dynamics. Here’s a time plot of Johns Hopkins U.S. death counts less COVID Tracking Project:

A deficit I can understand, but wiggles and bumps are harder to grok.

So in addition to debates about quality of tests, and whether there are excess deaths whose causes were misdiagnosed, there appears to be a major problem with data quality. That’s not to say that any source is wrong, simply that there are few standards on what should be counted how, and how states should count basic things, like cause of death. This not only has implications for COVID-19 policy, but if there is such variation here, one needs to wonder how good standard actuarial determination from the government are, when they rely upon local sources for basic determinations?

There are ways to do this kind of thing when many sources are used. These include weighting sources depending upon long term performance, a bit like Five Thirty Eight does to political opinion polls of Presidential popularity. The other thing to do, of course, would be to do stratified sampling or cluster sampling estimate some of these quantities, or something else (or something else).

Most importantly, it seems key to me that if inferences and forecasts are to be made with these data, the latent number of deaths per unit time needs to be considered a latent function sampled by these complicated processes, and it needs to be recovered first. No doubt that will be done with error, but large scale week-to-week policy should certainly not be predicated upon considering tallies and counts of these important statistics as literally accurate.

They are compiled by well-meaning, intelligent and educated people, but they are, it seems, under the exigencies of this pandemic, very dirty. There’s a lot here going on which isn’t neatly categorized as one of the usual problems seen in count data regression.

About ecoquant

See https://wordpress.com/view/667-per-cm.net/ Retired data scientist and statistician. Now working projects in quantitative ecology and, specifically, phenology of Bryophyta and technical methods for their study, notably Macrophotography. Some photos of mine: https://www.flickr.com/photos/198372469@N03/

View all posts by ecoquant →

This entry was posted in coronavirus, count data regression, COVID-19, descriptive statistics, epidemiology, pandemic, policy metrics, politics, population biology, population dynamics, quantitative biology, quantitative ecology, sampling, SARS-CoV-2, statistical ecology, statistical series, statistics. Bookmark the permalink.