There are several sources of information regarding *Covid-19* incidence now available. This post uses data from a single source: the COVID Tracking Project. In particular I restrict attention to cumulative daily case counts for the United States, the UK, and the states of New York, Massachusetts, and Connecticut. The United States data are available here. The data for the United Kingdom are available here. I’m only considering reports through the 26th of March.

Please note little of these models can be used to properly inform projections, and nothing here should be interpreted as medical advice for your personal care. Consult your physician for that. This is a scholarly investigation.

Beginning, the table of counts of positive tests for coronavirus in the United States is:

date | positive | as fraction of positive and negative tests |
---|---|---|

20200326 | 80735 | 0.1555 |

20200325 | 63928 | 0.1517 |

20200324 | 51954 | 0.1507 |

20200323 | 42152 | 0.1508 |

20200322 | 31879 | 0.1415 |

20200321 | 23197 | 0.1295 |

20200320 | 17033 | 0.1260 |

20200319 | 11719 | 0.1162 |

20200318 | 7730 | 0.1045 |

20200317 | 5723 | 0.1073 |

20200316 | 4019 | 0.1002 |

20200315 | 3173 | 0.1233 |

20200314 | 2450 | 0.1253 |

20200313 | 1922 | 0.1237 |

20200312 | 1315 | 0.1406 |

20200311 | 1053 | 0.1478 |

20200310 | 778 | 0.1697 |

20200309 | 584 | 0.1478 |

20200308 | 417 | 0.1515 |

20200307 | 341 | 0.1586 |

20200306 | 223 | 0.1243 |

20200305 | 176 | 0.1559 |

20200304 | 118 | 0.1363 |

Now, the increase in positive tests is driven by numbers of infections, but it is also heavily influenced by amount of testing. The same source, however, offers the cumulative number of negative tests and, so, I have expressed the positive test count as a fraction of the number of positive tests plus the number of negative tests in the rightmost column.

If the increase in the number of positive tests is related to the increase in the prevalence of the virus, and since the infection diffuses through the population, then the increase ought to be related to the number of positive tests. As noted, the increase in the number of positive tests could also be related to the administration of additional tests, so, with only positive tests data, these are confounded. However, since the cumulative number of tests administered is available, we can see how strongly the increase in the number of tests determines the number of positives, rather than an expansion in the number of cases.

Letting denote the cumulative count of number of positive cases on day , and the cumulative count of number of positive and number of negative tests, I’m interested in

the difference relationship. In another equivalent expression,

where denotes integral count noise.

This amounts to a linear regression on two covariates, with the resulting and indicating how strongly the increase in positive test counts is determined by the corresponding covariate. The left term is an AR(1) model. (See also.) Using **R**‘s *lm* function results in:

> fit.usa.pn summary(fit.usa.pn)

```
```Call:

lm(formula = D.usa ~ Q.usa[2:23] + PN.usa[2:23])

Residuals:

Min 1Q Median 3Q Max

-1411.26552 -491.51999 -172.32025 341.17292 1652.94005

Coefficients:

Estimate Std. Error t value Pr(>|t|)

(Intercept) 717.8679340137 301.0981976893 2.38417 0.027702 *

Q.usa[2:23] 0.1981967029 0.0087058298 22.76597 2.986e-15 ***

PN.usa[2:23] -0.0025874532 0.0016414650 -1.57631 0.131460

---

Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 821.74248 on 19 degrees of freedom

Multiple R-squared: 0.97426944, Adjusted R-squared: 0.97156096

F-statistic: 359.71083 on 2 and 19 DF, p-value: 7.9298893e-16

Doing a model without an intercept improves matters negligibly

```
```> fit.usa.noint summary(fit.usa.noint)

Call:

lm(formula = D.usa ~ Q.usa[1:22] + PN.usa[1:22] + 0)

Residuals:

Min 1Q Median 3Q Max

-1950.894882 -61.091628 8.041637 584.900171 2464.061157

Coefficients:

Estimate Std. Error t value Pr(>|t|)

Q.usa[1:22] 0.26801699520 0.01121341634 23.90146 3.5088e-16 ***

PN.usa[1:22] 0.00018947235 0.00132300092 0.14321 0.88755

---

Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 1158.4248 on 20 degrees of freedom

Multiple R-squared: 0.96619952, Adjusted R-squared: 0.96281947

F-statistic: 285.8538 on 2 and 20 DF, p-value: 1.946384e-15

Note that the dependence upon the total number of tests is weak. If the data of increases in positive tests is plotted against cumulative number of positive tests the previous day and the intercept-free line is superimposed:

###### (Click on figure to see larger image.)

Below is the same analysis applied to New York State:

> fit.ny.noint summary(fit.ny.noint)

```
```Call:

lm(formula = D.ny ~ Q.ny[1:22] + PN.ny[1:22] + 0)

Residuals:

Min 1Q Median 3Q Max

-1208.249960 -17.977446 3.489550 450.488558 2247.967816

Coefficients:

Estimate Std. Error t value Pr(>|t|)

Q.ny[1:22] 0.24758294065 0.01993668614 12.41846 7.4014e-11 ***

PN.ny[1:22] 0.00027030184 0.00452203747 0.05977 0.95293

---

Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 988.67859 on 20 degrees of freedom

Multiple R-squared: 0.88521556, Adjusted R-squared: 0.87373712

F-statistic: 77.119823 on 2 and 20 DF, p-value: 3.9703623e-10

Below is the same analysis applied to Massachusetts:

> fit.ma.noint summary(fit.ma.noint)

```
```Call:

lm(formula = D.ma ~ Q.ma[1:22] + PN.ma[1:22] + 0)

Residuals:

Min 1Q Median 3Q Max

-24.4349493 -7.8075546 -0.8168569 8.3116535 42.0791554

Coefficients:

Estimate Std. Error t value Pr(>|t|)

Q.ma[1:22] 0.17585256284 0.04856188611 3.62121 0.0035068 **

PN.ma[1:22] 0.00032421632 0.00052756343 0.61455 0.5503242

---

Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 18.978648 on 12 degrees of freedom

(8 observations deleted due to missingness)

Multiple R-squared: 0.54502431, Adjusted R-squared: 0.46919503

F-statistic: 7.1875178 on 2 and 12 DF, p-value: 0.0088701128

Below is the same analysis applied to Connecticut:

> fit.ct.noint summary(fit.ct.noint)

```
```Call:

lm(formula = D.ct ~ Q.ct[1:19] + PN.ct[1:19] + 0)

Residuals:

Min 1Q Median 3Q Max

-117.0797164 -1.3840086 1.2688491 11.6339214 127.2275604

Coefficients:

Estimate Std. Error t value Pr(>|t|)

Q.ct[1:19] 0.29036730663 0.04573094922 6.34947 7.2613e-06 ***

PN.ct[1:19] 0.00027743517 0.00461747631 0.06008 0.95279

---

Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 55.280014 on 17 degrees of freedom

Multiple R-squared: 0.70352934, Adjusted R-squared: 0.66865044

F-statistic: 20.170628 on 2 and 17 DF, p-value: 3.2497097e-05

And, finally, below is the same analysis applied to the United Kingdom:

> fit.uk.noint summary(fit.uk.noint)

```
```Call:

lm(formula = D.uk ~ Q.uk[1:29] + 0)

Residuals:

Min 1Q Median 3Q Max

-423.525900 -8.490398 5.891371 37.652952 352.355823

Coefficients:

Estimate Std. Error t value Pr(>|t|)

Q.uk[1:29] 0.2174876924 0.0074761256 29.09096 < 2.22e-16 ***

---

Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 156.16753 on 28 degrees of freedom

Multiple R-squared: 0.9679738, Adjusted R-squared: 0.96683

F-statistic: 846.28412 on 1 and 28 DF, p-value: < 2.22045e-16

Recall that the UK data comes from a different source and they do not have available the total numbers of tests performed. Consequently, I did not check to see if the dependency of increases in positive tests was weak in their case. Also, the UK’s testing procedure and its biochemistry is probably different than that in the United States, although there is no assurance tests performed from state to state are strictly exchangeable.

The plot for the UK is:

Summarizing the no intercept results:

Country/State | standard error in |
adjusted | |
---|---|---|---|

United States | 0.268 | 0.011 | 0.963 |

New York | 0.247 | 0.020 | 0.0.874 |

Massachusetts | 0.176 | 0.048 | 0.469 |

Connecticut | 0.290 | 0.045 | 0.669 |

United Kingdom | 0.217 | 0.007 | 0.967 |

As noted above the United Kingdom results are not strictly comparable for the reasons given. The conclusion is that the AR(1) term dominates and at least for country levels appears predictive. The low values for Massachusetts and Connecticut may be because of low numbers of counts, or relative youth of the epidemic there. Thus, my interpretation is that increase in positive case count is driven by the disease, not because testing has accelerated. This disagrees with an implication I made in an earlier blog post.

This work was inspired in part by the article,

D. Benvenuto, M.Giovanetti, L. Vassallo, S.Angeletti, M. Ciccozzi, “Application of the ARIMA model on the COVID-2019 epidemic dataset“, *Data in brief*, **29** (2020), 105340.

*Update*, 2020-03-29, 00:24 EDT

Other recent work with **R** regarding the COVID-19 pandemic: