## COVID-19 statistics, a caveat : Sources of data matter

There are a number of sources of COVID-19-related demographics, cases, deaths, numbers testing positive, numbers recovered, and numbers testing negative available. Many of these are not consistent with one another. One could hope at least rates would be consistent, but in a moment I’ll show evidence that even those cannot be counted.

The Johns Hopkins Coronavirus Tracker site is a major project displaying COVID-19-related statistics. They also put their datasets and series on a Github site, which is where I draw data for presentation and analyses here. I do that because I’ve read and understood their approach and methodology, and I’m sticking by the same source over time.

There are other sources of data, including, for example, a Github site operated by the New York Times, and the COVID Tracking Project which is operated in association with Atlantic Magazine. The COVID Tracking Project, as far as I know, does not have a Github, but they offer a data portal.

Here I’m going to compare the series of counts of deaths in the United States from the Johns Hopkins Coronavirus Track and those from the COVID Tracking Project. The two don’t have completely consistent coverage: Johns Hopkins reports lag those reported by COVID Tracking, probably because, as I’ve heard in interviews with some of the principals, they want to check and double-check the counts before releasing them. I imagine all data sites undergo some vetting of counts.

But other complexities intervene. There are questions about whose death statistics does a site want to count? Reports from state governments, results released to the U.S. National Center for Health Statistics by coroners, independent assessments from universities, etc. Do you count presumed COVID-19 deaths, or are all deaths a subset of those actually tested and confirmed positive? Do you get counts from the U.S. Centers for Disease Control?

There are no completely correct ways here. The choices are different, and some may be better than others, and, worse, some of the choices may be better than others at different times. Counting deaths from a pandemic, like doing the U.S. Census, can be a politically charged proposition, if not at the federal level, then in U.S. states. Cities might differ in their estimates of their own reported deaths than their states, perhaps because of different counting rules or perhaps because they are using the counts to manage the pandemic and they want them to reflect a local character of their population.

There are also differences in reporting times. It’s long been known that Tuesdays seem to have a blip of deaths cases, and weekends have a paucity, but that cannot be real. It is an artefact of how the tabulation of deaths is done.

But I was surprised that not only does Johns Hopkins differ from the COVID Tracking Project, the disparity grows over time.

So this shows that the COVID Tracking Project exhibits a growing deficit in deaths versus Johns Hopkins over time. But a closer look not only reveals a deficit, it reveals dynamics. Here’s a time plot of Johns Hopkins U.S. death counts less COVID Tracking Project:

A deficit I can understand, but wiggles and bumps are harder to grok.

So in addition to debates about quality of tests, and whether there are excess deaths whose causes were misdiagnosed, there appears to be a major problem with data quality. That’s not to say that any source is wrong, simply that there are few standards on what should be counted how, and how states should count basic things, like cause of death. This not only has implications for COVID-19 policy, but if there is such variation here, one needs to wonder how good standard actuarial determination from the government are, when they rely upon local sources for basic determinations?

There are ways to do this kind of thing when many sources are used. These include weighting sources depending upon long term performance, a bit like Five Thirty Eight does to political opinion polls of Presidential popularity. The other thing to do, of course, would be to do stratified sampling or cluster sampling estimate some of these quantities, or something else (or something else).

Most importantly, it seems key to me that if inferences and forecasts are to be made with these data, the latent number of deaths per unit time needs to be considered a latent function sampled by these complicated processes, and it needs to be recovered first. No doubt that will be done with error, but large scale week-to-week policy should certainly not be predicated upon considering tallies and counts of these important statistics as literally accurate.

They are compiled by well-meaning, intelligent and educated people, but they are, it seems, under the exigencies of this pandemic, very dirty. There’s a lot here going on which isn’t neatly categorized as one of the usual problems seen in count data regression.

## First substantial mechanism for long term immunity from SARS-CoV-2 : T-cells

M. Leslie, “T cells found in COVID-19 patients ‘bode well’ for long-term immunity“, Science, doi:10.1126/science.abc8120.

A. Grifoni, et al, “Targets of T cell responses to SARS-CoV-2 coronavirus in humans with COVID-19 disease and unexposed individuals“, Cell, 14th May 2020.

J. Braun, L. Loyal, et al, “Presence of SARS-CoV-2 reactive T cells in COVID-19 patients and healthy donors“, medRχiv (not peer-reviewed, with comments), 22 April 2020.

## Dissection of the Dr Judy Mikovits’ claims in AAAS Science

https://www.sciencemag.org/news/2020/05/fact-checking-judy-mikovits-controversial-virologist-attacking-anthony-fauci-viral

Excerpt:

Science asked Mikovits for an interview for this article. She responded by sending an empty email with, as attachments, a copy of her new book and a PowerPoint of a 2019 presentation titled “Persecution and Coverup.”

Below are some of the video’s main claims and allegations, along with the facts.

Interviewer: Dr. Judy Mikovits has been called one of the most accomplished scientists of her generation.

Mikovits had authored 40 scientific papers and wasn’t widely known in the scientific community before she published the 2009 Science paper claiming a link between a new retrovirus and CFS. The paper was later proven erroneous and retracted.

Interviewer: Her 1991 doctoral thesis revolutionized the treatment of HIV/AIDS.

Mikovits’s Ph.D. thesis, “Negative Regulation of HIV Expression in Monocytes,” had no discernible impact on the treatment of HIV/AIDS.

Interviewer: At the height of her career, Dr. Mikovits published a blockbuster article in the journal Science. The controversial article sent shock waves through the scientific community, as it revealed that the common use of animal and human fetal tissues was unleashing devastating plagues of chronic diseases.

The paper revealed nothing of the sort; it only claimed to show a link between one condition, CFS, and a mouse retrovirus.

.
.
.

Mikovits: Heads of our entire HHS [Department of Health and Human Services] colluded and destroyed my reputation and the Department of Justice and the [Federal Bureau of Investigation] sat on it, and kept that case under seal.

Mikovits has presented no direct evidence that HHS heads colluded against her.

Mikovits: [Fauci] directed the cover-up. And in fact, everybody else was paid off, and paid off big time, millions of dollars in funding from Tony Fauci and … the National Institute of Allergy and Infectious Diseases. These investigators that committed the fraud, continue to this day to be paid big time by the NIAID.

It’s not clear which fraud and what cover-up Mikovits is talking about exactly. There is no evidence that Fauci was involved in a cover-up or that anyone was paid off with funding from him or his institute. No one has been charged with fraud in relation to Mikovits’s allegations.

Mikovits: It started really when I was 25 years old, and I was part of the team that isolated HIV from the saliva and blood of the patients from France where [virologist Luc] Montagnier had originally isolated the virus. … Fauci holds up the publication of the paper for several months while Robert Gallo writes his own paper and takes all the credit, and of course patents are involved. This delay of the confirmation, you know, literally led to spreading the virus around, you know, killing millions.

At the time of HIV’s discovery, Mikovits was a lab technician in Francis Ruscetti’s lab at NCI and had yet to receive her Ph.D. There is no evidence that she was part of the team that first isolated the virus. Her first published paper, co-authored with Ruscetti, was on HIV and published in May 1986, 2 years after Science published four landmark papers that linked HIV (then called HTLV-III by Gallo’s lab) to AIDS. Ruscetti’s first paper on HIV appeared in August 1985. There is no evidence that Fauci held up either paper or that this led to the death of millions.

Interviewer: If we activate mandatory vaccines globally, I imagine these people stand to make hundreds of billions of dollars that own the vaccines.

Mikovits: And they’ll kill millions, as they already have with their vaccines. There is no vaccine currently on the schedule for any RNA virus that works.

Vaccines have not killed millions; they have saved millions of lives. Many vaccines that work against RNA viruses are on the market, including for influenza, measles, mumps, rubella, rabies, yellow fever, and Ebola.

.
.
.

Interviewer: Do you believe that this virus [SARS-CoV-2] was created in the laboratory?

Mikovits: I wouldn’t use the word created. But you can’t say naturally occurring if it was by way of the laboratory. So it’s very clear this virus was manipulated. This family of viruses was manipulated and studied in a laboratory where the animals were taken into the laboratory, and this is what was released, whether deliberate or not. That cannot be naturally occurring. Somebody didn’t go to a market, get a bat, the virus didn’t jump directly to humans. That’s not how it works. That’s accelerated viral evolution. If it was a natural occurrence, it would take up to 800 years to occur.

Scientific estimates suggest the closest virus to SARS-CoV-2, the virus that causes COVID-19, is a bat coronavirus identified by the Wuhan Institute of Virology (WIV). Its “distance” in evolutionary time to SARS-CoV-2 is about 20 to 80 years. There is no evidence this bat virus was manipulated.

Interviewer: And do you have any ideas of where this occurred?

Mikovits: Oh yeah, I’m sure it occurred between the North Carolina laboratories, Fort Detrick, the U.S. Army Medical Research Institute of Infectious Diseases, and the Wuhan laboratory.

There is no evidence that SARS-CoV-2 originated at WIV. NIAID’s funding of a U.S. group that works with the Wuhan lab has been stopped, which outraged many scientists.

Mikovits: Italy has a very old population. They’re very sick with inflammatory disorders. They got at the beginning of 2019 an untested new form of influenza vaccine that had four different strains of influenza, including the highly pathogenic H1N1. That vaccine was grown in a cell line, a dog cell line. Dogs have lots of coronaviruses.

There is no evidence that links any influenza vaccine, or a dog coronavirus, to Italy’s COVID-19 epidemic.

Mikovits: Wearing the mask literally activates your own virus. You’re getting sick from your own reactivated coronavirus expressions, and if it happens to be SARS-CoV-2, then you’ve got a big problem.

It’s not clear what Mikovits means by “coronavirus expressions.” There is no evidence that wearing a mask can activate viruses and make people sick.

Mikovits: Why would you close the beach? You’ve got sequences in the soil, in the sand. You’ve got healing microbes in the ocean in the salt water. That’s insanity.

It’s not clear what Mikovits means by sand or soil “sequences.” There is no evidence that microbes in the ocean can heal COVID-19 patients.

See the remainder of the article at Science.

Bobby Seagull.

Great.

## “We are Republicans and we want Trump defeated.”

### And that’s why it’s here.

The Lincoln Project apparently introduced this advert on Twitter with the explanatory text:

Since you are awake and trolling the internet, @realDonaldTrump, here is a little bedtime story just for you … good night, Mr. President.#Sleeptight #CountryOverParty

## “Seasonality of COVID-19, Other Coronaviruses, and Influenza” (from Radford Neal’s blog)

Thorough review with documentation and technical criticism of claims of COVID-19 seasonality or its lack. Whichever way this comes down, the links are well worth the visit!

Will the incidence of COVID-19 decrease in the summer? There is reason to hope that it will, since in temperate climates influenza and the four coronaviruses that are among the causes of the “common cold” do follow a seasonal pattern, with many fewer cases in the summer. If COVID-19 is affected by season, this would […]

## Phase plane plots of COVID-19 deaths

There are many ways of presenting analytical summaries of new series data for which the underlying mechanisms are incompletely understood. With respect to series describing the COVID-19 pandemic, Tamino has used piecewise linear models. I have mentioned how I prefered penalized (regression) splines. I intended to illustrate something comparable with what Tamino does (see also), but, then, I thought I could do both that and expand the discussion to include a kind of presentation used in functional data analysis, namely phase plane plots. Be sure to visit that last link for an illustration of what the curves below mean.

Here I’ll look at various series describing regional characteristics of the COVID-19 pandemic. These include counts of deaths, counts of cases, and counts of number of recovered people. My data source is exclusively the Johns Hopkins Coronavirus Resource Center and, in particular, their Github repository. I have also examined a similar repository maintained by The New York Times, but I find it not as well curated, particularly when it admits the most recent data, data which sometimes exhibits wild swings.

That said, and while curation is important, as well as data validation, no organization can do much when the reporting agencies do not provide counts uniformly or backfill counts to their proper dates. Thus, if a large count of deaths from COVID-19 is identified, the proper procedure is to tag each death with an estimate of the day of death, and correct figures for deaths on those days. Instead, some government agencies appear to dump the discovered counts in all on the present day. There is evidence as well that some agencies hold on to counts until they reach some number, and then release the counts all at once. These are important because this sampling or reporting policy will manifest as changes in dynamics of the disease when, in fact, it is nothing of the kind.

I’m not saying this kind of distortion is deliberate, although, in some cases, it could be. (We cannot tell if it’s deliberate evasion or bureaucratic rigidity, or simply they-don’t-care-as-long-as-they-report.) It is indicative of overworked, fatigued demographers, epidemiologists, and public health professionals making a best effort to provide accurate counts. Johns Hopkins makes an effort to try to resolve some of these changes. But it cannot do everything.

This idea of having real time reporting is an essential part of mounting a proper pandemic response and is, at least in the United States, another thing for which we were woefully underprepared. Without it, even the highest government officials are flying low to the ground with a fogged up window, so to speak.

Nevertheless, I have accepted the Johns Hopkins data as they are and analyzed it in the manner described below. I’ll comment about some features the plots show, some of which pertains to hints about how data is collected and reported.

I’m very much in favor of avoiding use of absolute counts. We don’t know how comprehensive those are, so, I prefer to look at rates of change. Professor David Spiegelhalter discusses why rates are a good idea in connection with COVID-19 reporting.

Note that the data used only goes through the 29th of April 2020.

## Approach

The thing about track counts of death is that, even if they are complete, which in the middle of a crisis they seldom are, these counts don’t tell you much. What do you want to know? Ultimately, you want to know how effective a set of policies are behaving in order to curtail deaths due to COVID-19. So, for instance, if the cumulative counts of deaths for Germany are considered:

this doesn’t really say a lot about what’s driving the shape of the curve. A phase plane plot is constructed by calculating good estimates of the time series first and second (time) derivatives at each point, and then plotting the second derivative against the first. The first derivative is the rate at which people are dying, and the second derivative is the change per unit time in that rate, just like acceleration or deceleration of a car, where the second time derivative being positive means acceleration, and it negative negative means deceleration.

From the perspective of policy, if the rate is appreciably positive, you want policy measures which decelerate, driving that rate down. A win is when both first and second derivatives are near zero. So, for Germany,

An interpretation is that Germany began taking the matter seriously about 25th-28th March, and activated a bunch of measures on the 31st of March, and then has struggled to decelerate rates of death. The loops denote oscillations where a policy is implemented and, for some reason, there’s a reaction, whether in government, or public, which accelerates the rate again. This tug of war has, on average, kept the rate of death about 200 deaths a day, but it’s not going down.

A caution here, however: These dates of action should be interpreted with care. A death occurs about 10-14 days after infection. So, if a policy action is taken, the effects of that action won’t be seen for 2-3 weeks. So, rather than saying Germany took it seriously 25th March, I should have said 3rd of March, and actions were implemented about the 10th of March.

Can it be managed as I say, driving both acceleration and rates to zero? Sure, consider Taiwan:

The disease got a little away from them in early March, and then again but to a smaller degree in early April, but now it’s controlled. Note, however, that the number of cases per day was never permitted to get above twenty.

## A Review of Some Countries

So, what about the United States? It’s actually not too bad, except that it’s clear control measures have not been anywhere as stringent as the couple of countries already mentioned. There are no loops:

Sure, it’s good the rate is decelerating, but it’s not decelerating by much: Less than 50 deaths per day, per day. That’s when the rate is just under 2000 deaths per day.

What about the biggest contributor, New York State?

It’s clear New York is really struggling to get control of the pandemic, holding the rate to about 700 deaths a day, but it was accelerating again as of reporting on the 29th of April. Those cycles mean action, however.

What about Sweden? They’ve been touted as not having any lockdowns. I’d say Sweden is in trouble:

But that judgment rests on basically just a couple of datapoints. Perhaps they are askew, and the sharp upwards isn’t real. After all, they seem to have managed to keep the rate of death to about 70 per day on average, with a wide scatter.

## The Scene from (Some of) the States

Finally, let’s look at some states in the United States.

Consider Michigan, for instance, scene of much conflict over the lockdown measures their governor has taken:

Michigan has struggled as well, but whatever was done about 10th April or so really slammed the brakes on numbers of deaths.

Florida is interesting because although the number of deaths is (reported as) 50 per day, there is little evidence of acceleration or deceleration.

They have 1200 deaths. An interpretation is that it’s still early in Florida. Another is that all the deaths have not been reported yet. Another interpretation is that for whatever reason people dying of COVID-19 are being given a cause-of-death which is something else. There are a lot of elderly people in Florida. Perhaps a comparison of overall mortality rates there with historical would be advised. Note also that the overall shape of the acceleration vs rate of death in Florida is similar to the United States at large, although the magnitudes are different.

Finally, consider Georgia, another state where policy on managing COVID-19 has been contentious:

Despite the contention, it’s clear Georgia has been doing something effective to contain the disease. The trouble is that the latest data suggests it is beginning to get away from them, although it’s early to say if that will continue, or they will find a way to bring it back.

#### Update, 3rdMay 2020

I will be updating these results regularly. I also intend to drape a varying uncertainty area calculated from uncertainties in estimating acceleration and velocity.

The R code used to generate these is still in progress, but when it matures a bit I will make it publicly available.

## A SimCity for the Climate

SimCity is/was a classic simulation game teaching basics of public policy, energy management, and environmental regulation. My kids played it a lot. Heck, I played it a lot.

Now, Climate Interactive, Tom Fiddaman of Ventana Systems, Prof John Sterman of MIT Sloan, and Prof Juliette Rooney-Varga of UMass Lowell’s Climate Change Initiative have teamed up to produce En-Roads, a climate change solutions simulator.

In addition, Bloomberg Green has adapted the simulator with a simpler menu interface to allow you to try some things yourself.

And check out Bloomberg Green‘s climate data dashboard.

## Simplistic and Dangerous Models

Nice to see Generalized Additive Models used.

A few weeks ago there were none. Three weeks ago, with an entirely inadequate search strategy, ten cases were found. Last Saturday there were 43! With three inaccurate data points, there is enough information to fit an exponential curve: the prevalence is doubling every seven days. Armchair epidemiologists should start worrying that by Christmas there will be 1012 preprints relating COVID-19 to weather and climate unless an antidote is found.

Fortunately, a first version of an antidote to one form of the preprint plague that is sweeping the planet known as the SDM (apparently this not an acronym for Simplistic and Dangerous Model). A second version is due to be published soon.

So why are SDMs such a bad tool for trying to model the spread of COVID-19?

### 1) The system is not at equilibrium

The first case of what has become known as COVID-19 was reported on

View original post 717 more words

## Major Ocean Currents Drifting Polewards

Living on Earth, the environmental news program of Public Radio, featured Amy Bower, Senior Scientist at Woods Hole Oceanographic Institution, on 27th March 2020 to discuss new research from the Alfred Wegener Institute showing that major ocean currents are drifting northwards.

Dr Bower explained the research in this interview (for which there is also a transcript), and its implications for climate and fisheries.

## Keep fossil fuels in the ground

### Ah, wouldn’t it be lovely!?

Is this the beginning of the Minsky Moment Mark Carney has feared? In short, that was because the trading markets had not priced in (a) the risks from climate change, and (b) the risks from fossil fuels being abandoned in favor of clean energy. Dr Carney has repeated expressed his concern that if impacted companies did not provide this risk information to investors, they would be blind to the impending change and what it might cost them. The Minsky Moment, then, would be a sudden awakening on the part of investors, acknowledging the risk, and, in a short time, pricing in its implications into the values of their holdings, causing a market crash.

## “Lockdown WORKS”

Tamino favors LASSO LOESS and piecewise linear models. I favor splines, especially penalized smoothing splines via the R package pspline, using generalized cross validation to set the smoothing parameter. Tamino looks for breaks in the piecewise linear case to check for and test for significant changes. I use the first and higher derivatives of the spline.

Both methods are sound and good.

I don’t know how you might use a random forest regression for this purpose, but I bet there is a way. I doubt it is as good, though.

Over 2400 Americans died yesterday from Coronavirus. Here are the new deaths per day (“daily mortality”) in the USA since March 10, 2020 (note: this is an exponential plot)

View original post 241 more words

## Cloud brightening hits a salty snag

The proposal known as solar radiation management is complicated. It just got moreso. Released Wednesday:

### Fossum, K.N., Ovadnevaite, J., Ceburnis, D. et al. “Sea-spray regulates sulfate cloud droplet activation over oceans“, Climate and Atmospheric Science, 3(14): (2020).https://doi.org/10.1038/s41612-020-0116-2

##### [open access]

The above is an experimental essay regarding the effects of salt spray upon an artificial proposed technique to enhance the Twomey Effect.

Solar geoengineering attempts to mitigate some of the effects of climate disruption by fossil fuel emissions using various technologies to increase Earth’s albedo. While there are water droplet-based techniques, many involve injecting Sulphur-derived droplets into atmosphere at differing heights, because these droplets are bright. Were these to succeed, increasing albedo means that less solar radiation would reach Earth’s surface, thus cooling it. (See also.)

Assuming this remedy worked, meaning it controls increases in surface and oceanic temperatures, this will not solve ocean acidification, because carbonic acid and related concentrations in oceans will continue to increase, as long as emissions from fossil fuels and agriculture continue. Moreover, there are attendant risks: These particles fall out of atmosphere so they need to be continually replenished. There are concerns, as noted in a talk cited here by the late Professor Wally Broecker, that there may not be sufficient sulfur supply, or this may pinch the market for sulfur impacting prices of other products which demand it. If the replenishment were interrupted, say, by a war or a pandemic, there are concerns regarding the impulse response of a climate system where the blocking of radiation is suddenly released (“termination shock”).

Professor David Keith of Harvard University is a proponent of these measures. There are many, including Professor Ray Pierrehumbert, who are highly skeptical and suspicious (“albedo hacking”, originally due to Kintisch from 2010).

And there are also concerns this might not work, at least not well. And, the reasoning goes, if large scale impacts to humanity on the planet are to be safeguarded by betting on such a technique, especially if there’s a moral hazard that the true solution, stopping emissions from fossil fuel burning, will be hindered by the technology’s possibilities, it better be known to work, and work well.

One measure, cloud brightening, was dealt a blow by the paper cited at the top. It should be noted that

#### Horowitz, H. M., Holmes, C., Wright, A., Sherwen, T., Wang, X., Evans, M., et al. ( 2020). “Effects of sea salt aerosol emissions for marine cloud brightening on atmospheric chemistry: Implications for radiative forcing“, Geophysical Research Letters, 47, e2019GL085838. https://doi.org/10.1029/2019GL085838

##### (Also open access)

also reported on interactions of sea salt with aerosols in the marine cloud brightening case.

The net, from the Abstract by Fossum, et al:

### We present new experimental results from the remote Southern Ocean illustrating that, for a given updraft, the peak supersaturation reached in cloud, and consequently the number of droplets activated on sulfate nuclei, is strongly but inversely proportional to the concentration of sea-salt activated despite a 10-fold lower abundance. Greater sea-spray nuclei availability mostly suppresses sulfate aerosol activation leading to an overall decrease in cloud droplet concentrations; however, for high vertical updrafts and low sulfate aerosol availability, increased sea-spray can augment cloud droplet concentrations. This newly identiﬁed effect where sea-salt nuclei indirectly controls sulfate nuclei activation into cloud droplets could potentially lead to changes in the albedo of marine boundary layer clouds by as much as 30%.

The atmosphere is a complicated beastie.

## Machiavelli

### It’s right out of Machiavelli’s The Prince.

#covid_19 #coronavirus

Even for the Trump administration, it is odd they are pushing #Hydroxychloroquine and #Azithromycin so hard, against medical advice and evidence.

I’ve thought about this and, given the growing animosity between Trump and Navarro and the physicians, I think I know what’s going on. By vigorously advancing #Hydroxychloroquine and #Azithromycin, Trump, Giuliani, and Navarro are setting up the medical community and the HHS/CDC/FDA to be fall guys in case the #covid_19 virus produces far more fatalities than the administration’s projections. Note, by the way, that no reputable forecaster backs those projections up and the administration is refusing to reveal how they came up with them. For if they push these drugs whose efficacy is found to be zero or worse, they know the medical community will oppose their wholesale administration, and if the pandemic causes many more American deaths, they can claim that, well, if only the medical community didn’t oppose them so much maybe it wouldn’t have been so bad. In other words, it’s setting up so they don’t take the blame for the deaths, or at least can sow doubt among their core supporters.

### Update, 2020-04-07

Government officials have reportedly been told not to contradict Trump in public regarding his support for hydroxychloroquine and azithromycin. The original report apparently came from Politico. In addition, Trump is directing some of his government medical staff to work using these drugs into the population, thereby diminishing the number of them actually contending with the pandemic.

Without surprise, the American medical community, as demonstrated by an advisory editorial in the Annals of Internal Medicine, continues to argue that using these drugs as prophylactic or treatment against COVID-19 is inadvisable and harmful.

There are also now concerns about method and data analysis regarding the paper which is most cited in support of using these drugs, including some questions about how p-values were calculated (they did not use Yate’s correction for the contingency tables). A critical scholar did a reanalysis using better methods and discovered, among other things, that a mixed-effects logistic regression1 fit poorly but suggested that improvement in outcomes results primarily because of the passage of time.

### Afterthought, 2020-04-07

Since hydroxychloroquine is used to treat Lupus, you would think that those afflicted with Lupus would have a statistically lower chance of getting COVID-19 were hydroxychloroquine a prophylactic for it. That threshold would need to be adjusted because Lupus itself is an autoimmune disease which weakens immune systems and makes its victims more susceptible to infection. However, there is no evidence that these patients are protected and, in fact, Lupus patients need to be even more careful about handwashing and distancing.

## Oldie and Goodie: Testing a point Null Hypothesis: The irreconcilability of p-values and evidence’

A blog post by Professor Christian Robert mentioned a paper by Professors James Berger and Tom Sellke, which I downloaded several years back but never got around to reading.

### J. O. Berger, T. M. Sellke, “Testing a point Null Hypothesis: The irreconcilability of p values and evidence”, Journal of the American Statistical Association, March 1987, 82(397), Theory and Methods, 112-122.

I even overlooked the paper when I lectured about the statement on statistical significance and p-values by the American Statistical Association at my former employer. But it’s great. Abstract is below.

## “Social Distancing Works”

Nice work by Tamino, including showing when a log plot is appropriate and when it is not. His post, reblogged:

The death toll from Coronavirus in the U.S.A. stands at 4,059, and more alarming is the fact that yesterday brought nearly a thousand deaths in a single day. The numbers keep rising.

America has confirmed 188,639 cases (many more unconfirmed), more than any other country in the world (although Italy leads in fatalities with 12,428). The total number of cases in the U.S. shows a very unfortunate and frankly, scary trend: exponential growth.

View original post 539 more words

## New COVID-19 incidence in the United States as AR(1) processes

There are several sources of information regarding Covid-19 incidence now available. This post uses data from a single source: the COVID Tracking Project. In particular I restrict attention to cumulative daily case counts for the United States, the UK, and the states of New York, Massachusetts, and Connecticut. The United States data are available here. The data for the United Kingdom are available here. I’m only considering reports through the 26th of March.

Please note little of these models can be used to properly inform projections, and nothing here should be interpreted as medical advice for your personal care. Consult your physician for that. This is a scholarly investigation.

Beginning, the table of counts of positive tests for coronavirus in the United States is:

date positive as fraction of positive
and negative tests
20200326 80735 0.1555
20200325 63928 0.1517
20200324 51954 0.1507
20200323 42152 0.1508
20200322 31879 0.1415
20200321 23197 0.1295
20200320 17033 0.1260
20200319 11719 0.1162
20200318 7730 0.1045
20200317 5723 0.1073
20200316 4019 0.1002
20200315 3173 0.1233
20200314 2450 0.1253
20200313 1922 0.1237
20200312 1315 0.1406
20200311 1053 0.1478
20200310 778 0.1697
20200309 584 0.1478
20200308 417 0.1515
20200307 341 0.1586
20200306 223 0.1243
20200305 176 0.1559
20200304 118 0.1363

Now, the increase in positive tests is driven by numbers of infections, but it is also heavily influenced by amount of testing. The same source, however, offers the cumulative number of negative tests and, so, I have expressed the positive test count as a fraction of the number of positive tests plus the number of negative tests in the rightmost column.

If the increase in the number of positive tests is related to the increase in the prevalence of the virus, and since the infection diffuses through the population, then the increase ought to be related to the number of positive tests. As noted, the increase in the number of positive tests could also be related to the administration of additional tests, so, with only positive tests data, these are confounded. However, since the cumulative number of tests administered is available, we can see how strongly the increase in the number of tests determines the number of positives, rather than an expansion in the number of cases.

Letting $y_{t}$ denote the cumulative count of number of positive cases on day $t$, and $x_{t}$ the cumulative count of number of positive and number of negative tests, I’m interested in

$\delta y_{t} \sim a_{y} y_{t-1} + a_{x} x_{t-1} + \eta$

the difference relationship. In another equivalent expression,

$y_{t} - y_{t-1} = a_{y} y_{t-1} + a_{x} x_{t-1} + \eta$

where $\eta$ denotes integral count noise.

This amounts to a linear regression on two covariates, with the resulting $a_{y}$ and $a_{x}$ indicating how strongly the increase in positive test counts is determined by the corresponding covariate. The left term is an AR(1) model. (See also.) Using R‘s lm function results in:

 > fit.usa.pn summary(fit.usa.pn)

 Call: lm(formula = D.usa ~ Q.usa[2:23] + PN.usa[2:23]) Residuals: Min 1Q Median 3Q Max -1411.26552 -491.51999 -172.32025 341.17292 1652.94005 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 717.8679340137 301.0981976893 2.38417 0.027702 * Q.usa[2:23] 0.1981967029 0.0087058298 22.76597 2.986e-15 *** PN.usa[2:23] -0.0025874532 0.0016414650 -1.57631 0.131460 --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Residual standard error: 821.74248 on 19 degrees of freedom Multiple R-squared: 0.97426944, Adjusted R-squared: 0.97156096 F-statistic: 359.71083 on 2 and 19 DF, p-value: 7.9298893e-16 

Doing a model without an intercept improves matters negligibly

 > fit.usa.noint summary(fit.usa.noint) Call: lm(formula = D.usa ~ Q.usa[1:22] + PN.usa[1:22] + 0) Residuals: Min 1Q Median 3Q Max -1950.894882 -61.091628 8.041637 584.900171 2464.061157 Coefficients: Estimate Std. Error t value Pr(>|t|) Q.usa[1:22] 0.26801699520 0.01121341634 23.90146 3.5088e-16 *** PN.usa[1:22] 0.00018947235 0.00132300092 0.14321 0.88755 --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Residual standard error: 1158.4248 on 20 degrees of freedom Multiple R-squared: 0.96619952, Adjusted R-squared: 0.96281947 F-statistic: 285.8538 on 2 and 20 DF, p-value: 1.946384e-15 

Note that the dependence upon the total number of tests is weak. If the data of increases in positive tests is plotted against cumulative number of positive tests the previous day and the intercept-free line is superimposed:

###### (Click on figure to see larger image.)

Below is the same analysis applied to New York State:

 > fit.ny.noint summary(fit.ny.noint)

 Call: lm(formula = D.ny ~ Q.ny[1:22] + PN.ny[1:22] + 0) Residuals: Min 1Q Median 3Q Max -1208.249960 -17.977446 3.489550 450.488558 2247.967816 Coefficients: Estimate Std. Error t value Pr(>|t|) Q.ny[1:22] 0.24758294065 0.01993668614 12.41846 7.4014e-11 *** PN.ny[1:22] 0.00027030184 0.00452203747 0.05977 0.95293 --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Residual standard error: 988.67859 on 20 degrees of freedom Multiple R-squared: 0.88521556, Adjusted R-squared: 0.87373712 F-statistic: 77.119823 on 2 and 20 DF, p-value: 3.9703623e-10 

Below is the same analysis applied to Massachusetts:

 > fit.ma.noint summary(fit.ma.noint)

 Call: lm(formula = D.ma ~ Q.ma[1:22] + PN.ma[1:22] + 0) Residuals: Min 1Q Median 3Q Max -24.4349493 -7.8075546 -0.8168569 8.3116535 42.0791554 Coefficients: Estimate Std. Error t value Pr(>|t|) Q.ma[1:22] 0.17585256284 0.04856188611 3.62121 0.0035068 ** PN.ma[1:22] 0.00032421632 0.00052756343 0.61455 0.5503242 --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Residual standard error: 18.978648 on 12 degrees of freedom (8 observations deleted due to missingness) Multiple R-squared: 0.54502431, Adjusted R-squared: 0.46919503 F-statistic: 7.1875178 on 2 and 12 DF, p-value: 0.0088701128 

Below is the same analysis applied to Connecticut:

 > fit.ct.noint summary(fit.ct.noint)

 Call: lm(formula = D.ct ~ Q.ct[1:19] + PN.ct[1:19] + 0) Residuals: Min 1Q Median 3Q Max -117.0797164 -1.3840086 1.2688491 11.6339214 127.2275604 Coefficients: Estimate Std. Error t value Pr(>|t|) Q.ct[1:19] 0.29036730663 0.04573094922 6.34947 7.2613e-06 *** PN.ct[1:19] 0.00027743517 0.00461747631 0.06008 0.95279 --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Residual standard error: 55.280014 on 17 degrees of freedom Multiple R-squared: 0.70352934, Adjusted R-squared: 0.66865044 F-statistic: 20.170628 on 2 and 17 DF, p-value: 3.2497097e-05 

And, finally, below is the same analysis applied to the United Kingdom:
 > fit.uk.noint summary(fit.uk.noint)

 Call: lm(formula = D.uk ~ Q.uk[1:29] + 0) Residuals: Min 1Q Median 3Q Max -423.525900 -8.490398 5.891371 37.652952 352.355823 Coefficients: Estimate Std. Error t value Pr(>|t|) Q.uk[1:29] 0.2174876924 0.0074761256 29.09096 < 2.22e-16 *** --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Residual standard error: 156.16753 on 28 degrees of freedom Multiple R-squared: 0.9679738, Adjusted R-squared: 0.96683 F-statistic: 846.28412 on 1 and 28 DF, p-value: < 2.22045e-16 

Recall that the UK data comes from a different source and they do not have available the total numbers of tests performed. Consequently, I did not check to see if the dependency of increases in positive tests was weak in their case. Also, the UK’s testing procedure and its biochemistry is probably different than that in the United States, although there is no assurance tests performed from state to state are strictly exchangeable.

The plot for the UK is:

Summarizing the no intercept results:

Country/State $a_{y}$ standard error
in $a_{y}$
adjusted $R^{2}$
United States 0.268 0.011 0.963
New York 0.247 0.020 0.0.874
Massachusetts 0.176 0.048 0.469
Connecticut 0.290 0.045 0.669
United Kingdom 0.217 0.007 0.967

As noted above the United Kingdom results are not strictly comparable for the reasons given. The conclusion is that the AR(1) term dominates and at least for country levels appears predictive. The low $R^{2}$ values for Massachusetts and Connecticut may be because of low numbers of counts, or relative youth of the epidemic there. Thus, my interpretation is that increase in positive case count is driven by the disease, not because testing has accelerated. This disagrees with an implication I made in an earlier blog post.

This work was inspired in part by the article,

D. Benvenuto, M.Giovanetti, L. Vassallo, S.Angeletti, M. Ciccozzi, “Application of the ARIMA model on the COVID-2019 epidemic dataset“, Data in brief, 29 (2020), 105340.

#### Update, 2020-03-29, 00:24 EDT

Other recent work with R regarding the COVID-19 pandemic:

## What happens when time sampling density of a series matches its growth

This is the newly updated map of COVID-19 cases in the United States, updated, presumably, because of the new emphasis upon testing:

How do we know this is the recent of recent testing? Look at the map of active cases:

To the degree numbers of active cases fall down on top of cumulative cases means these are recent detections.

In other words, while concerns about importing COVID-19 cases from Europe are of some concern, the virus is here, the disease is here, and a typical person in the United States is much more likely to contract the disease from a fellow infected American who has not travelled than a European person (note there are no prohibitions against Americans coming home) coming here.

The lesson is that if a process has a certain rate of growth, and the sampling density in time isn’t keeping up with that growth, it is inevitable there will be extreme underreporting.

I would like to understand if that suppression of testing was deliberate or not. Between the present administration’s classifications of COVID-19-related information under National Security acts and the documented suppression of information on federal Web sites relating to climate change, I would be highly suspicious that such suppression, which would put 45 in an unpopular light, was accidental.

##### (update, 2020-03-15, 2113 EDT)

In the above, note that once a sampling density in time is increased to match growth of the counts, then it will appear as if the rate of growth of cases is extraordinary. That is false, of course, but it is a consequence of the failure to have an adequate sampling density (in time) in the first case.

MSRI talk:

## Primary producers

These are from NASA’s Aqua-MODIS, meaning, Aqua satellite, MODIS instrument:

##### (h/t Earth Observatory at NASA)

See my related blog post. And, note, it’s all about the phenology.

## R ecosystem package coronavirus

Dr Rami Krispin of the Johns Hopkins University Center for Systems Science and Engineering (JHU CCSE) has just released the R package coronavirus, which “provides a daily summary of the Coronavirus (COVID-19) cases by state/province“, caused by 2019-nCoV.

###### (update 2020-03-12 1337 EDT)

I have noticed that the data in the R package cited above seriously lags the data directly available at the Github site. So if you want current data, go to the latter source.

There is also an open source article in Lancet about the capability.

### Estimating mortality rate

##### (update 2020-03-08, 15:57 EDT)

There’s a lot of good quantitative epidemiological work being done in and around the 2019-nCoV outbreak. I was drawn to this report by Professor Andrew Gelman, one of the co-authors of STAN, but the primary work was done by Dr Julien Riou and colleagues in a pre-print paper at medR$\chi$iv.

There’ll probably be a good deal more once people have some time to get into case data. I’ll keep this post updated with interesting ones I find.

##### (update 2020-03-12, 11:37 EDT)

New compilation site by Avi Schiffmann, a budding data scientist.

##### (update 2020-03-12-13:26 EDT)

The Johns Hopkins data are also accessible from their at their Github site. These are updated every five minutes. And it is cloned in many places.

There is alternative source available and online, and having data last dated from 11 March 2020, yesterday, from many contributors (list available online). This source’s map has much finer spatial resolution.

##### (update 2020-03-13, 11:49 EDT)

The New England Journal of Medicine has a full page with links about and concerning the Coronavirus (SARS-CoV-2) causing Covid-19, and it is fully open access.

Science magazine this week features four SARS-CoV-2-related articles:

##### (update 2020-03-13, 12:41 EDT)

Posted in data presentation, data science, epidemiology | 1 Comment

## Curiositys recent view of Mars

“NASA Curiosity Project Scientist Ashwin Vasavada guides this tour of the rover’s view of the Martian surface.”

With a little imagination, feels like a de-vegetated version of the Northern Coastal Ranges of California, looking inland.

# via Code for causal inference: Interested in astronomical applications

## Reanalysis of business visits from deployments of a mobile phone app

This reports a reanalysis of data from the deployment of a mobile phone app, as reported in:

M. Yauck, L.-P. Rivest, G. Rothman, “Capture-recapture methods for data on the activation of applications on mobile phones“, Journal of the American Statistical Association, 2019, 114:525, 105-114, DOI: 10.1080/01621459.2018.1469991.

The article is as linked. There is supplementary information and most datasets are freely available.

The data set analyzed in the paper was provided by Ninth Decimals, 625 Ellis St., Ste. 301, Mountain View, CA 94043, a marketing platform using location data, as indicated in the documentation of the original paper.

This work is concerned with the analysis of marketing data on the activation of applications (apps) on mobile devices. Each application has a hashed identification number that is specific to the device on which it has been installed. This number can be registered by a platform at each activation of the application. Activations on the same device are linked together using the identification number. By focusing on activations that took place at a business location, one can create a capture-recapture dataset about devices, that is, users, that “visited” the business: the units are owners of mobile devices and the capture occasions are time intervals such as days. A unit is captured when she activates an application, provided that this activation is recorded by the platform providing the data. Statistical capture-recapture techniques can be applied to the app data to estimate the total number of users that visited the business over a time period, thereby providing an indirect estimate of foot traffic. This article argues that the robust design, a method for dealing with a nested mark-recapture experiment, can be used in this context. A new algorithm for estimating the parameters of a robust design with a fairly large number of capture occasions and a simple parametric bootstrap variance estimator are proposed. Moreover, new estimation methods and new theoretical results are introduced for a wider application of the robust design. This is used to analyze a dataset about the mobile devices that visited the auto-dealerships of a major auto brand in a U.S. metropolitan area over a period of 1 year and a half. Supplementary materials for this article, including a standardized description of the materials available for reproducing the work, are available as an online supplement.

The paper applies mark-recapture methods in a digital context, estimating the market size for a particular mobile phone application week by week, and offering a new means for doing so based upon estimating equations. The application to mobile phones was to estimate

My interest was to use the freely available data to check their overall estimates, but also to use it as the keystone of semi-mathematical tutorial introducing mark-recapture methods, and illustrating modern non-parametric varieties of mark-recapture inference. In particular, the tutorial highlights little known or at least infrequently cited work by R. Tanaka:

• R. Tanaka, “Estimation of Vole and Mouse Populations on Mt Ishizuchi and on the Uplands of Southern Shikoku“, Journal of Mammalogy, November 1951, 32(4), 450-458.
• R. Tanaka, “Safe approach to natural home ranges, as applied to the solution of edge effect subjects, using capture-recapture data in vole populations”, Proceedings of the 6th Vertebrate Pest Conference, 1974, Vertebrate Pest Conference Proceedings Collection, University of Nebraska, Lincoln.
• R. Tanaka, “Controversial problems in advanced research on estimating population densities of small rodents”, Researches on Population Ecology, 1980, 22, 1-67.

Seber cites Tanaka in:

G. A. F. Seber, The Estimation of Animal Abundance and Related Parameters, 2nd edition, 1982, Griffin, London, page 145.

The number of first encounters of phone apps, of which there are a total of 9316 distinct ones, looks like:

Only 1654 are observed two or more times during the 77 week experiment. For the total population seen, the minimum number
seen per week was 50 and the maximum was 509. The profile of the number of first observed apps, that first observation constituting a “marking”, looks like:

The technique by Tanaka reprised in the companion paper and extended there to populations with dramatically varying size and, separately, to populations with dramatically varying capture probabilities, determined there is no basis for appreciable variation in the population size or capture probability during the experiment. Accordingly, its estimate of the total population is done as follows:

Because Yauck, Rivest, and Rothman were principally interested in the marketing-oriented determination of visits per week, they do not offer an estimate of the overall population. So their results are difficult to compare with the result from the Tanaka extension. However, assuming the degree of overlap among visits to the business per week is smaller for a smaller interval, and is also smaller because, by all accounts, the number of deployed apps is small, the authors offer estimate for the first 20 weeks. This subset was also estimated by the Tanaka method, and produced:

A tally of the 20 week estimates from Yauck, Rivest, and Rothman gives the closed population gives a total of 1191, the “robust” model
a total of 1202, and the Jolly-Seber 1132. These compare with Tanaka which gives an estimate of population of size 2141, with a low estimate of 1896, and a high estimate of 2459.

Reasons for discrepancies may vary. For one, Yauck, Rivest, and Rothman are not trying to estimate overall population, or at least they do not report these. Their overall profile of population per week, taken from their paper, is shown below:

But they also do report a seasonal variation in capture probability, as shown in the following chart from their work:

Such a variation could explain a shortfall. It remains a discrepancy, though, why the segmentation of the Tanaka fit doesn’t acknowledge such variations in capture probability.

The implementation of the Tanaka extension is done using R code, with related files being available online. The segmentation is overseen using the facilities of the segmented package for R developed by Professor V. M. R. Muggeo.

If you think you have one or more problems which might benefit from this kind of insight and technique, be sure to reach out. You can find my contact information under the “Above” tab at the top of the blog. Or just leave a comment to this post below. All Comments are moderated and kept private until I review them, so if you’d prefer to keep the reach-out non-public, just say so, and I will.

And stay tuned for other blog posts. After mark-recapture, in the beginning of March May*, I’ll be showing how causal inference and techniques like propensity scoring can be used in scientific research as well as in policy and marketing assessments.