New COVID-19 incidence in the United States as AR(1) processes

There are several sources of information regarding Covid-19 incidence now available. This post uses data from a single source: the COVID Tracking Project. In particular I restrict attention to cumulative daily case counts for the United States, the UK, and the states of New York, Massachusetts, and Connecticut. The United States data are available here. The data for the United Kingdom are available here. I’m only considering reports through the 26th of March.

Please note little of these models can be used to properly inform projections, and nothing here should be interpreted as medical advice for your personal care. Consult your physician for that. This is a scholarly investigation.

Beginning, the table of counts of positive tests for coronavirus in the United States is:

date positive as fraction of positive
and negative tests
20200326 80735 0.1555
20200325 63928 0.1517
20200324 51954 0.1507
20200323 42152 0.1508
20200322 31879 0.1415
20200321 23197 0.1295
20200320 17033 0.1260
20200319 11719 0.1162
20200318 7730 0.1045
20200317 5723 0.1073
20200316 4019 0.1002
20200315 3173 0.1233
20200314 2450 0.1253
20200313 1922 0.1237
20200312 1315 0.1406
20200311 1053 0.1478
20200310 778 0.1697
20200309 584 0.1478
20200308 417 0.1515
20200307 341 0.1586
20200306 223 0.1243
20200305 176 0.1559
20200304 118 0.1363

Now, the increase in positive tests is driven by numbers of infections, but it is also heavily influenced by amount of testing. The same source, however, offers the cumulative number of negative tests and, so, I have expressed the positive test count as a fraction of the number of positive tests plus the number of negative tests in the rightmost column.

If the increase in the number of positive tests is related to the increase in the prevalence of the virus, and since the infection diffuses through the population, then the increase ought to be related to the number of positive tests. As noted, the increase in the number of positive tests could also be related to the administration of additional tests, so, with only positive tests data, these are confounded. However, since the cumulative number of tests administered is available, we can see how strongly the increase in the number of tests determines the number of positives, rather than an expansion in the number of cases.

Letting y_{t} denote the cumulative count of number of positive cases on day t, and x_{t} the cumulative count of number of positive and number of negative tests, I’m interested in

\delta y_{t} \sim a_{y} y_{t-1} + a_{x} x_{t-1} + \eta

the difference relationship. In another equivalent expression,

y_{t} - y_{t-1} = a_{y} y_{t-1} + a_{x} x_{t-1} + \eta

where \eta denotes integral count noise.

This amounts to a linear regression on two covariates, with the resulting a_{y} and a_{x} indicating how strongly the increase in positive test counts is determined by the corresponding covariate. The left term is an AR(1) model. (See also.) Using R‘s lm function results in:


> fit.usa.pn summary(fit.usa.pn)

Call:
lm(formula = D.usa ~ Q.usa[2:23] + PN.usa[2:23])

Residuals:
Min 1Q Median 3Q Max
-1411.26552 -491.51999 -172.32025 341.17292 1652.94005

Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 717.8679340137 301.0981976893 2.38417 0.027702 *
Q.usa[2:23] 0.1981967029 0.0087058298 22.76597 2.986e-15 ***
PN.usa[2:23] -0.0025874532 0.0016414650 -1.57631 0.131460
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 821.74248 on 19 degrees of freedom
Multiple R-squared: 0.97426944, Adjusted R-squared: 0.97156096
F-statistic: 359.71083 on 2 and 19 DF, p-value: 7.9298893e-16

Doing a model without an intercept improves matters negligibly

> fit.usa.noint summary(fit.usa.noint)

Call:
lm(formula = D.usa ~ Q.usa[1:22] + PN.usa[1:22] + 0)

Residuals:
Min 1Q Median 3Q Max
-1950.894882 -61.091628 8.041637 584.900171 2464.061157

Coefficients:
Estimate Std. Error t value Pr(>|t|)
Q.usa[1:22] 0.26801699520 0.01121341634 23.90146 3.5088e-16 ***
PN.usa[1:22] 0.00018947235 0.00132300092 0.14321 0.88755
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 1158.4248 on 20 degrees of freedom
Multiple R-squared: 0.96619952, Adjusted R-squared: 0.96281947
F-statistic: 285.8538 on 2 and 20 DF, p-value: 1.946384e-15

Note that the dependence upon the total number of tests is weak. If the data of increases in positive tests is plotted against cumulative number of positive tests the previous day and the intercept-free line is superimposed:

(Click on figure to see larger image.)

Below is the same analysis applied to New York State:


> fit.ny.noint summary(fit.ny.noint)

Call:
lm(formula = D.ny ~ Q.ny[1:22] + PN.ny[1:22] + 0)

Residuals:
Min 1Q Median 3Q Max
-1208.249960 -17.977446 3.489550 450.488558 2247.967816

Coefficients:
Estimate Std. Error t value Pr(>|t|)
Q.ny[1:22] 0.24758294065 0.01993668614 12.41846 7.4014e-11 ***
PN.ny[1:22] 0.00027030184 0.00452203747 0.05977 0.95293
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 988.67859 on 20 degrees of freedom
Multiple R-squared: 0.88521556, Adjusted R-squared: 0.87373712
F-statistic: 77.119823 on 2 and 20 DF, p-value: 3.9703623e-10

Below is the same analysis applied to Massachusetts:


> fit.ma.noint summary(fit.ma.noint)

Call:
lm(formula = D.ma ~ Q.ma[1:22] + PN.ma[1:22] + 0)

Residuals:
Min 1Q Median 3Q Max
-24.4349493 -7.8075546 -0.8168569 8.3116535 42.0791554

Coefficients:
Estimate Std. Error t value Pr(>|t|)
Q.ma[1:22] 0.17585256284 0.04856188611 3.62121 0.0035068 **
PN.ma[1:22] 0.00032421632 0.00052756343 0.61455 0.5503242
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 18.978648 on 12 degrees of freedom
(8 observations deleted due to missingness)
Multiple R-squared: 0.54502431, Adjusted R-squared: 0.46919503
F-statistic: 7.1875178 on 2 and 12 DF, p-value: 0.0088701128

Below is the same analysis applied to Connecticut:


> fit.ct.noint summary(fit.ct.noint)

Call:
lm(formula = D.ct ~ Q.ct[1:19] + PN.ct[1:19] + 0)

Residuals:
Min 1Q Median 3Q Max
-117.0797164 -1.3840086 1.2688491 11.6339214 127.2275604

Coefficients:
Estimate Std. Error t value Pr(>|t|)
Q.ct[1:19] 0.29036730663 0.04573094922 6.34947 7.2613e-06 ***
PN.ct[1:19] 0.00027743517 0.00461747631 0.06008 0.95279
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 55.280014 on 17 degrees of freedom
Multiple R-squared: 0.70352934, Adjusted R-squared: 0.66865044
F-statistic: 20.170628 on 2 and 17 DF, p-value: 3.2497097e-05

And, finally, below is the same analysis applied to the United Kingdom:

> fit.uk.noint summary(fit.uk.noint)

Call:
lm(formula = D.uk ~ Q.uk[1:29] + 0)

Residuals:
Min 1Q Median 3Q Max
-423.525900 -8.490398 5.891371 37.652952 352.355823

Coefficients:
Estimate Std. Error t value Pr(>|t|)
Q.uk[1:29] 0.2174876924 0.0074761256 29.09096 < 2.22e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 156.16753 on 28 degrees of freedom
Multiple R-squared: 0.9679738, Adjusted R-squared: 0.96683
F-statistic: 846.28412 on 1 and 28 DF, p-value: < 2.22045e-16

Recall that the UK data comes from a different source and they do not have available the total numbers of tests performed. Consequently, I did not check to see if the dependency of increases in positive tests was weak in their case. Also, the UK’s testing procedure and its biochemistry is probably different than that in the United States, although there is no assurance tests performed from state to state are strictly exchangeable.

The plot for the UK is:

Summarizing the no intercept results:

Country/State a_{y} standard error
in a_{y}
adjusted R^{2}
United States 0.268 0.011 0.963
New York 0.247 0.020 0.0.874
Massachusetts 0.176 0.048 0.469
Connecticut 0.290 0.045 0.669
United Kingdom 0.217 0.007 0.967

As noted above the United Kingdom results are not strictly comparable for the reasons given. The conclusion is that the AR(1) term dominates and at least for country levels appears predictive. The low R^{2} values for Massachusetts and Connecticut may be because of low numbers of counts, or relative youth of the epidemic there. Thus, my interpretation is that increase in positive case count is driven by the disease, not because testing has accelerated. This disagrees with an implication I made in an earlier blog post.

This work was inspired in part by the article,

D. Benvenuto, M.Giovanetti, L. Vassallo, S.Angeletti, M. Ciccozzi, “Application of the ARIMA model on the COVID-2019 epidemic dataset“, Data in brief, 29 (2020), 105340.

Update, 2020-03-29, 00:24 EDT

Other recent work with R regarding the COVID-19 pandemic:

About ecoquant

See http://www.linkedin.com/in/deepdevelopment/ and https://667-per-cm.net/about
This entry was posted in coronavirus, COVID-19, epidemiology, pandemic, regression, SARS-CoV-2. Bookmark the permalink.

Leave a reply. Commenting standards are described in the About section linked from banner.

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

This site uses Akismet to reduce spam. Learn how your comment data is processed.