Modeling even simple aspects of the covid-19 pandemic is more than challenging in the United States because of the sparsity of data. There is no comprehensive testing and little is known about the efficacy of testing. Adding to the uncertainty, we not only lack information on the total numbers infected, but we do not know if those who have been infected can be re-infected, the proportion of those who are infected who are asymptomatic, and when and how long they may infect others. In short, you have a forecaster’s nightmare.
The “sparse data” issue is compounded because most, if not all, epidemiological models are complex. They are designed to provide a lot of information needed in the face of a pandemic . Making this nightmare even worse is the fact that while at least some data may be available to use at the national level, applying a typical epidemiological model to a subnational area such as a county is virtually impossible without having to endure a very heavy “assumption burden.” All of these issues leave local officials and residents literally in the dark when it comes to trying to get a picture of what may be coming and how to prepare for it. And it is in these small, local areas that many battles are being fought.
However, all is not lost for local areas. I have been producing county level forecasts that employ simple methods and concepts that need no more than the sparse data available in the U.S. With a simple geometric model (The same one used to calculate compound interest) using only the cumulative daily count of confirmed cases, and the “impact analysis” framework, I have developed forecasts for Whatcom County, Washington. The “impact analysis” framework looks at how an event might unfold if it was left to run its course relative to interventions designed to alter that course. The baseline forecast is intended to provide a picture of the future in the absence of the “intervention” of interest. In our case, this is the introduction of containment measures, designed to slow the spread of the virus. This framework is not perfect in that it is not a controlled experiment. However, to paraphrase George P. Box, the highly influential statistician, while this model is “wrong” because like all models it is an approximation to reality, it appears to be useful.
The idea underlying my work is similar to the idea underlying the “barefoot doctors” a movement started in the 1930s in which personnel were trained in basic public health and medicine and dispatched to rural villages in China. This movement was designed to transfer basic knowledge and practices to areas of China that had suffered more than their share of epidemics and related health problems. By all accounts, it was highly successful and only ended in the 1980s as China moved to embrace capitalism. So, following this idea, one might call the approach I have taken with Whatcom County, “Barefoot Demography.”
The initial forecast was a “baseline” that launched from March 28th, showing what the county could expect in the absence of an “intervention,” which in this case was containment measures. As of the date of the expected peak of the initial surge, April 25th, the baseline showed that 6,151 confirmed cases were expected.
About a week after the baseline forecast was released, I followed with the first update. The update used data that reflected the initial effects of containment measures put in place by the governor of Washington, Jay Inslee, on March 25th. Like the baseline, it was based on a simple geometric model. As of the date of the expected peak of the initial surge, April 25th, this first update showed that 2,696 confirmed cases were expected. By reducing the initial rate of growth by 2.47 percent in less than a week (from a daily rate of increase of 1.1584 to 1.1298, the containment measures led to 56 percent reduction in the cumulative number of expected cases by April 25th. These results showed the local officials and the general public that the sacrifices made by the many people who strove to adhere to the containment measures, which included foregoing work and income, were paying a dividend in cases averted and lives saved.
The second update continued with more good news in terms of the reduction of cases being brought about from the containment measures, an 82 percent reduction in the total number of confirmed cases relative to the baseline as of April 25th. In addition to providing the 2nd update in terms of the simple geometric model I had been using, I turned to a more complex model, exponential in nature, to generate a forecast because there were now 17 days of data available. To implement this model, I used the exponential model function found under the “curve fitting” choices in the NCSS statistical software.
Besides using all of the information available, I employed the exponential model to assess the adequacy of the geometric model and its results. This is a move made possible by the availability of more data. Keep in mind that the geometric model is based on observations taken at two time points. The exponential model I employed uses information from all of the available points in time for which observations are available. In the case of this evaluation, the daily rate of change calculated by the exponential model is less than that used in the geometric model. This suggests that the latter’s forecast will be on the “high side.”
The exponential model yielded a forecast of 961 total confirmed covid-19 cases as of April 25th, which was only 157 cases fewer than that forecasted by the simple geometric model, a relative difference of -16.3 percent). Looking at the 95 percent prediction interval accompanying the exponential model, I found that the results of the simple geometric model. (1,118) fell well within the interval, which had 715 cases as the lower limit and 1,207 cases as the upper limit. Even though, as expected, the geometric model produced forecasts on the “high side,” the results were not outside the prediction interval limits of the exponential model. All in all, these results suggested that the simple geometric model had done a reasonable job with the sparse data available.
On April 17th, the third update was published. This one used a three-parameter logistic model, which was selected from the curve fitting choices offered by the NCSS Statistical Software System. This model was selected because it was clear the initial explosive growth indicated by the baseline forecast had been brought sufficiently under control by the containment measures that the surge was near its peak and on the verge of plateauing. In addition, there was a sufficient set of observations to support this model. As discussed earlier in regard to the exponential and geometric models, the daily rate of change underlying a logistic model will be less than the exponential model’s even when the same data are used – the former is used when the peak is near, which by definition in regard to a pandemic means that there is evidence that the daily rate of change in confirmed cases has declined.
Like the earlier updates, this third update brought welcome news, particularly in light of the sacrifices made by the many people who strove to adhere to the containment measures, which included foregoing work and income. These sacrifices paid a huge dividend in cases averted and lives saved. Per the third update, by April 25th these sacrifices were expected to bring about a 95 percent reduction in the initial expected number of confirmed cases as shown in the baseline forecast. I noted that it was a tremendous achievement that had done a lot to reduce the risk to the first responders, healthcare, grocery, and other workers who had put themselves at higher levels of risk by staying at essential jobs.
What is my take-away from this experience to-date? Namely, that when employed within an “impact analysis” framework, simple models selected with reasonable judgment (Use a geometric model in the initial stages of a surge, not a linear model) can provide a reasonable view of the future with only sparse data to support them. The primary force driving this idea is that in the absence of the information provided by such models framed by the impact analysis perspective, people living in counties and small towns will have some idea of what they might be facing, rather than remaining in the dark. As more data become available, these simple models can be replaced with more complex ones, including those designed to provide more information such as R0 (the rate of infection) Case Fatality Rates, and hospitalization rates.
The major limit on the information provided by the geometric model, the exponential, and the logistic model, is that all of the observations are “confirmed cases,” which means the forecasts themselves provide only a picture of future confirmed cases. However, they could be extended so that they can provide a picture of not only confirmed, but also the numbers who are “positive” but unconfirmed. One way in which this can be done is to apply the results of the estimation method developed by me and Ron Cossman for this purpose ) to the numbers of confirmed cases forecasted to get an estimate of the positive, unconfirmed cases. This is relatively simple. This could be further supplemented by developing rates of change in the positive, unconfirmed population. Again, even this step would be relatively simple.
Before closing, I note that Kesten Green and J. Scott Armstrong, arguably the latter being one of the most highly respected forecasters in the world, found no evidence that complex models are more accurate than simple ones. It is worthwhile to consider the points made by Green and Armstrong in conjunction with the view underlying Leo Breiman’s “Two Cultures” description of the field of statistics.
Breiman placed himself in what he called the “problem solving culture,” where problem solving is more important than adhering to traditional methods and protocols. The other culture was labeled by Breiman as the “data modelling culture.” By this, he means those who follow the theoretical specifications as found in the literature and have unquestioning adherence to established methodologies. He likened this to a religious cult. The other category is comprised of those more inclined to follow nature’s mechanisms rather than that specified by the data modeling culture.
I note that in the case of complex models, (such as the one developed for the COVID-19 pandemic at the Institute for Health Metrics Evaluation in Seattle), they clearly supply more information than does the simple geometric model, but this is of little use if the models lack sufficient data to implement and are “assumption heavy”.
Similarly, as Tom Burch (my retired demographic colleague with a courtesy appointment at the University of Victoria, in British Columbia) has pointed out to me, we can live without perfect accuracy, as long as the information being provided by a simple method is reasonable because this is better than knowing nothing about the future. However, if the information is not reasonable, but instead, misleading, then we have a different situation. This can be taken as advice on the selection of models and the data used to operationalize them – do not rush into this process, it requires some thought and when more data become available, evaluation.
It is appropriate to close with a statement about the scientific method, which briefly is to develop a hypothesis, gather data and select an appropriate method to study it, and revise the hypothesis as needed. This is on-going process that also involves thinking probabilistically rather than deterministically (see here and here).