Modeling even simple aspects of the covid-19 pandemic is more than challenging in the United States because of the sparsity of data. There is no comprehensive testing and little is known about the efficacy of testing. Adding to the uncertainty, we not only lack information on the total numbers infected, but we do not know if those who have been infected can be re-infected, the proportion of those who are infected who are asymptomatic, and when and how long they may infect others. In short, you have a forecaster’s nightmare.
The “sparse data” issue is compounded because most, if not all, epidemiological models are complex. They are designed to provide a lot of information needed in the face of a pandemic . Making this nightmare even worse is the fact that while at least some data may be available to use at the national level, applying a typical epidemiological model to a subnational area such as a county is virtually impossible without having to endure a very heavy “assumption burden.” All of these issues leave local officials and residents literally in the dark when it comes to trying to get a picture of what may be coming and how to prepare for it. And it is in these small, local areas that many battles are being fought.
However, all is not lost for local areas. I have been producing county level forecasts that employ simple methods and concepts that need no more than the sparse data available in the U.S. With a simple geometric model (The same one used to calculate compound interest) using only the cumulative daily count of confirmed cases, and the “impact analysis” framework, I have developed forecasts for Whatcom County, Washington. The “impact analysis” framework looks at how an event might unfold if it was left to run its course relative to interventions designed to alter that course. The baseline forecast is intended to provide a picture of the future in the absence of the “intervention” of interest. In our case, this is the introduction of containment measures, designed to slow the spread of the virus. This framework is not perfect in that it is not a controlled experiment. However, to paraphrase George P. Box, the highly influential statistician, while this model is “wrong” because like all models it is an approximation to reality, it appears to be useful.
The idea underlying my work is similar to the idea underlying the “barefoot doctors” a movement started in the 1930s in which personnel were trained in basic public health and medicine and dispatched to rural villages in China. This movement was designed to transfer basic knowledge and practices to areas of China that had suffered more than their share of epidemics and related health problems. By all accounts, it was highly successful and only ended in the 1980s as China moved to embrace capitalism. So, following this idea, one might call the approach I have taken with Whatcom County, “Barefoot Demography.”
The initial forecast was a “baseline” that launched from March 28th, showing what the county could expect in the absence of an “intervention,” which in this case was containment measures. As of the date of the expected peak of the initial surge, April 25th, the baseline showed that 6,151 confirmed cases were expected.
About a week after the baseline forecast was released, I followed with the first update. The update used data that reflected the initial effects of containment measures put in place by the governor of Washington, Jay Inslee, on March 25th. Like the baseline, it was based on a simple geometric model. As of the date of the expected peak of the initial surge, April 25th, this first update showed that 2,696 confirmed cases were expected. By reducing the initial rate of growth by 2.47 percent in less than a week (from a daily rate of increase of 1.1584 to 1.1298, the containment measures led to 56 percent reduction in the cumulative number of expected cases by April 25th. These results showed the local officials and the general public that the sacrifices made by the many people who strove to adhere to the containment measures, which included foregoing work and income, were paying a dividend in cases averted and lives saved.
The second update continued with more good news in terms of the reduction of cases being brought about from the containment measures, an 82 percent reduction in the total number of confirmed cases relative to the baseline as of April 25th. In addition to providing the 2nd update in terms of the simple geometric model I had been using, I turned to a more complex model, exponential in nature, to generate a forecast because there were now 17 days of data available. To implement this model, I used the exponential model function found under the “curve fitting” choices in the NCSS statistical software.
Besides using all of the information available, I employed the exponential model to assess the adequacy of the geometric model and its results. This is a move made possible by the availability of more data. Keep in mind that the geometric model is based on observations taken at two time points. The exponential model I employed uses information from all of the available points in time for which observations are available. In the case of this evaluation, the daily rate of change calculated by the exponential model is less than that used in the geometric model. This suggests that the latter’s forecast will be on the “high side.”
The exponential model yielded a forecast of 961 total confirmed covid-19 cases as of April 25th, which was only 157 cases fewer than that forecasted by the simple geometric model, a relative difference of -16.3 percent). Looking at the 95 percent prediction interval accompanying the exponential model, I found that the results of the simple geometric model. (1,118) fell well within the interval, which had 715 cases as the lower limit and 1,207 cases as the upper limit. Even though, as expected, the geometric model produced forecasts on the “high side,” the results were not outside the prediction interval limits of the exponential model. All in all, these results suggested that the simple geometric model had done a reasonable job with the sparse data available.
On April 17th, the third update was published. This one used a three-parameter logistic model, which was selected from the curve fitting choices offered by the NCSS Statistical Software System. This model was selected because it was clear the initial explosive growth indicated by the baseline forecast had been brought sufficiently under control by the containment measures that the surge was near its peak and on the verge of plateauing. In addition, there was a sufficient set of observations to support this model. As discussed earlier in regard to the exponential and geometric models, the daily rate of change underlying a logistic model will be less than the exponential model’s even when the same data are used – the former is used when the peak is near, which by definition in regard to a pandemic means that there is evidence that the daily rate of change in confirmed cases has declined.
Like the earlier updates, this third update brought welcome news, particularly in light of the sacrifices made by the many people who strove to adhere to the containment measures, which included foregoing work and income. These sacrifices paid a huge dividend in cases averted and lives saved. Per the third update, by April 25th these sacrifices were expected to bring about a 95 percent reduction in the initial expected number of confirmed cases as shown in the baseline forecast. I noted that it was a tremendous achievement that had done a lot to reduce the risk to the first responders, healthcare, grocery, and other workers who had put themselves at higher levels of risk by staying at essential jobs.
In Summary
What is my take-away from this experience to-date? Namely, that when employed within an “impact analysis” framework, simple models selected with reasonable judgment (Use a geometric model in the initial stages of a surge, not a linear model) can provide a reasonable view of the future with only sparse data to support them. The primary force driving this idea is that in the absence of the information provided by such models framed by the impact analysis perspective, people living in counties and small towns will have some idea of what they might be facing, rather than remaining in the dark. As more data become available, these simple models can be replaced with more complex ones, including those designed to provide more information such as R0 (the rate of infection) Case Fatality Rates, and hospitalization rates.
The major limit on the information provided by the geometric model, the exponential, and the logistic model, is that all of the observations are “confirmed cases,” which means the forecasts themselves provide only a picture of future confirmed cases. However, they could be extended so that they can provide a picture of not only confirmed, but also the numbers who are “positive” but unconfirmed. One way in which this can be done is to apply the results of the estimation method developed by me and Ron Cossman for this purpose ) to the numbers of confirmed cases forecasted to get an estimate of the positive, unconfirmed cases. This is relatively simple. This could be further supplemented by developing rates of change in the positive, unconfirmed population. Again, even this step would be relatively simple.
Before closing, I note that Kesten Green and J. Scott Armstrong, arguably the latter being one of the most highly respected forecasters in the world, found no evidence that complex models are more accurate than simple ones. It is worthwhile to consider the points made by Green and Armstrong in conjunction with the view underlying Leo Breiman’s “Two Cultures” description of the field of statistics.
Breiman placed himself in what he called the “problem solving culture,” where problem solving is more important than adhering to traditional methods and protocols. The other culture was labeled by Breiman as the “data modelling culture.” By this, he means those who follow the theoretical specifications as found in the literature and have unquestioning adherence to established methodologies. He likened this to a religious cult. The other category is comprised of those more inclined to follow nature’s mechanisms rather than that specified by the data modeling culture.
I note that in the case of complex models, (such as the one developed for the COVID-19 pandemic at the Institute for Health Metrics Evaluation in Seattle), they clearly supply more information than does the simple geometric model, but this is of little use if the models lack sufficient data to implement and are “assumption heavy”.
Similarly, as Tom Burch (my retired demographic colleague with a courtesy appointment at the University of Victoria, in British Columbia) has pointed out to me, we can live without perfect accuracy, as long as the information being provided by a simple method is reasonable because this is better than knowing nothing about the future. However, if the information is not reasonable, but instead, misleading, then we have a different situation. This can be taken as advice on the selection of models and the data used to operationalize them – do not rush into this process, it requires some thought and when more data become available, evaluation.
It is appropriate to close with a statement about the scientific method, which briefly is to develop a hypothesis, gather data and select an appropriate method to study it, and revise the hypothesis as needed. This is on-going process that also involves thinking probabilistically rather than deterministically (see here and here).
Comments by Readers
Michael Riordan
Apr 24, 2020As I mentioned to John Servais at the time, but did not enter a comment about it, the problem with simple geometric models based on exponential growth is that they will never turn over — which is what we now see happening in a few states like WA and NY, if not nationally. A straightforward exponential function will always increase exponentially, no matter what the daily increase chosen. Ultimately the more complex models are needed to try to gauge reality, despite the greater uncertainty in parameters and assumptions.
David A. Swanson
Apr 24, 2020Did I fail to mention that, for obvious reasons, neither the geometric nor the exponential models are designed to capture the entire course of an epidemic, only the path up to the estimated date of the peak? When the peak appears to be near, it is time to look at a logistic model to get an idea of what may lay slightly beyond a peak in terms of a plateau period, a path I took with Whatcom County’s third update. To look at what the downhill side might bring, one has several choices, including a ratio of 2nd order polynomials (Y = a +bx+cx^2)/(1 +dx +ex^2). where a, b, c and d are parameters to be estimated. Another possibility is a modified exponential model (Y = a*(x^b)exp(-cx). There are more models that can do this, and to the point below, especially, the epidemiological ones.
As stated in the article, the driving force for my work is the fact that so little, if anything, was known about the path to the estimated peak of the initial surge in local areas. The epidemiological models were noticeably absent, not only in regard to Whatcom County, but virtually every other county that was not in a heavily populated urban area (as well as more than a few of the latter). I never expected (nor do I now) to provide any post peak forecasts in regard to the initial surge for two reasons: (1) the epidemiological and other complicated models should have sufficient data to operate around the time the peak of the initial surge is reached; and (2) the urgency is not on the downhill slope, it is on the uphill slope. Because I anticipate that the epidemiological and other complicated models (reason 1) will be running effectively sometime after the initial surge peaks, I do not anticipate having to use simple models to forecast the paths of subsequent surges to their peaks. Hence, I see my work as being largely done in terms of forecasts. The time is coming for those with the experience needed to run epidemiological models to step in. Hopefully, the data needed to do this will finally come online, including the testing, the tracking, the correct coding of deaths, and so on.
Steve Harris
Apr 24, 2020David,
Looking strictly at his scientific method (e.g. regression modeling), is there any credibility in his conclusions?
https://www.spiked-online.com/2020/04/22/there-is-no-empirical-evidence-for-these-lockdowns/?fbclid=IwAR34bPaSsHM5zLFYV_sHqSOf9y8vkaewojrchN67_EY4xAwclE1m_zm9aLU
Steve
David A. Swanson
Apr 24, 2020Simple regression-like models work well for places that started with low number of cases and never even threatened to explode. Kittitas County (CWU) and Whitman (WSU) are two cases in point. Before the virus took hold, classes were cancelled, dorms largely closed and students went home. Unlike WWU and Whatcom County, CWU and WSU really dominate their respective counties and the data to-date for them show that a linear model can be used. In fact, I have been forecasting these two counties using the ARIMA model, (Auto-Regressive Integrated Moving Average), which at its core is a regression model. Same with Nantucket Island off the coast of Cape Cod, MA, The ARIMA approach works well for these cases where the “containment intervention” was drastic (go home!) and dominated the whole of the county. It may also work well in a sparsely populated, isolated place (such as Garfield County, WA). However, as the Faroe Islands have shown once he virus gains entry, the cases will explode in the absence of containment meaures. The Faroes learned this lesson the hard way and similarly, so did Iceland. So, although Garfield as of today is the only county in Washington without a confirmed case, it may end up like the Faroes, in the throes of an outbreak of cases it will struggle to control.
When it comes to counties like Benton, Grant, and Whatcom in WA (counties for which I have developed baseline and updated forecasts - not reported for Grant County), a regression (linear) approach does not work well in the early stages of an initial surge because as these counties have shown, the outbreak had the potential to be explosive (exponential) in the absence of containment measures.
This means in such a situation that a regression approach will understate the n of cases as one goes into the future up to the peak. Moreover, it is clear to me that using a linear approach in the early stages of the initial surge would not have worked well in other counties I have been forecasting, including Clark County,NV (Las Vegas), San Luis Obispo County, CA (which is similar to Whatcom in that Cal Poly U is there, but like WWU, does not dominate the county), Shelby County, TN (Memphis) and Norfolk County, VA (the huge naval base is located there) Like Benton and Whatcom counties, they all showed the potential for explosive growth in the early states of their initial surges and linear models were not the ones to use with them at that point in time.
As an aside, the ARIMA forecasts I have done for Nantucket provide prediction intervals. I am working with a local colleague on Nantucket and we are advising the local health authorities to keep a close eye on reported cases relattive to the upper limit of the interval. It is highly likely that a lot of summer visitors will start showing up even in the face of this pandemic and our advice was if reported cases start tending above the ARIMA forecast numbers as these visitors come ( a yellow warning light that explosive, exponential growth may occur) and then hit the upper prediction limit ( a red danger light), they should implement their plans for containment, with “pre-warnings” to all as the yellow light starts flashing. This same cautions would be appropriate for Kittitas and Whitman counties when students return.
The close relationship that all regression based procedures have with inferental statistics is powerful in that one can use “interval estimates” (prediction intervals) in addition to “point estimates.” This is a feature found in many non-linear models as well, including the exponential model I selected for the second update of Whatcom County that I aslo used to evaluate the simple (compound interest formula) geometric model.
Thanks for your question. I hope that this helped to answer it.
Steve Harris
Apr 25, 2020Thank you for the response.