COVID19 update, April 10, 2020: all models are wrong, but some are useful.

“All models are wrong, but some are useful.” Thus spake one of the leading lights of statistics in the 20th Century, George E. P. Box FRS

Models can be useful however — if you remember that a map is not the territory, a representation is not an object, and a model is not reality. Sadly, the distinction between a theory and a model is lost on most people who are not scientists themselves (and sadly, on some people who call themselves scientists).

We hear a lot in the media about how pessimistic predictions of some modelers later had to be revised downward by nearly two orders of magnitude. Lots of snickering, for sure, but understand the incentive structure here. If you ask a modeler, “just how bad can this get?” and she gives you her worst-case estimate — and later the data coming in cause a drastic revision downward — you will normally be grateful. If she comes instead with a best-case estimate, and it later turns out to be much worse, you are likely to blame the modeler “if only you’d warned me, I’d have pushed for much harder measures”…

That said, some of the “models” now being referred to aren’t really models in the usual sense at all, but rather nonlinear regression fits to actual data, with uncertainty bands provided. I’m sure that whatever function the IHME people use for fitting COVID19 statistics in various countries is a bit more sophisticated than sigmoid functions, but the “total deaths” graphs look quite similar to a sigmoid to this mad scientist’s eye.

The nice thing about such “phenomenological models” [*] is that they are trivially adjusted to new data as they come in: add the data point, refit, get your new uncertainty band, and presto!

In this morning’s DIE WELT, I read an interview with a mathematics professor named Moritz Kaßmann at the University of Bielefeld , who got interested in this subject early on as one of his students returned from Wuhan and gave him the heads-up “this [expletive] is going to hit in Germany as well”.

Anyway, he had a good look at the German COVID19 statistics, and noted that they were surprisingly well fitted by the following (for people in my day job) very simple function:

f(t) = A exp(Bt – Ct^2) = A exp(Bt) exp(-Ct^2)

where ln(x) represents the natural logarithm of x, A corresponds to the number of cases at t=0, the exp(Bt) term corresponds to the exponential growth phase and the Gaussian term exp(-C t^2) corresponds to the damping phase, which is stronger as C grows larger, and absent if C=0 (since exp(0)=1).

Now if you take the logarithm of the data ln f(t), this fit becomes simplified to a quadratic regression

ln f(t) = ln(A) + B t – C t^2

At low t, this function will show exponential growth, but at longer t, the Gaussian damping will become more prominent, and eventually a turnover will occur. Now let’s apply this to the active cases in Germany, for example (data taken from the Johns Hopkins website):

data points in blue, regression curve in orange

Active cases are defined as “diagnosed – cured – deceased”.

Well, if such a simple and “parsimonious” (in terms of only having 3 parameters) model has such a high “coefficient of determination” — R^2 = 0.9977 means that 99.77% of the variance in the data is reproduced by the fitted curve — there has to be something to it. You don’t find such high R^2 values under a horse’s tail, pardon my Dutch.

Extrapolating the fitted function will get more uncertain as you leave the actual data range, to be sure, but we are clearly just days from the plateau phase between April 14 and 19.

Here Prof. Kaßmann discusses his method (in German) on YouTube:

He notes that a lot of discussion in Germany centers on the “doubling rate” (how much time it takes for the number of cases to double), but that the methods for evaluating teh doubling rates are kind-of slapdash: extracted from the daily growth rate or the moving average over 5 days thereof. With a fit function like this, it can be evaluated analytically using just 

t2= ln(2)/( -2 C t + B)

Where -2 C t + B is the first derivative of – C t^2 + B t + A — if you like, the slope of the tangent of the curve at point t.

Note that for large enough t, t2 becomes negative as the curve turns over : at that point –t2 becomes the halving rate.  According to “Kaßmann ’s slope trick”, the doubling rate for active cases in Germany is already in the once-a-month region. 

I made similar graphs for Belgium and for Israel: I get basically the same shape, 8 days and 4 days shifted right, respectively. The coefficients of determination for these primitive fits are 0.9974 and 0.9976, respectively.

We are not out of the woods yet — this is not the time to get cocky. But the light at the end of the tunnel is becoming visible. And thus, as a sustained shutdown will wreak increasing havoc on the economy, this is the time to get serious, creative, and agile about “back to normal” measures. In particular the food supply chain cannot be left untended. Society can live without rock concerts, soccer games, or discos. It can definitely live with telecommuting for IT professions. But it cannot live without agriculture and food processing, or (G-d forbid) the price we pay in lives might well exceed the toll from the virus.

To my Jewish readers, mo`adim le-simcha. To my Christian readers, have a meaningful Good Friday, and soon a happy Easter.

[*] The term “phenomenological” refers to a fit that aims to reproduce data (the numbers as they are) but does not make any physical, chemical,… model assumptions about the form of the equation.

2 thoughts on “COVID19 update, April 10, 2020: all models are wrong, but some are useful.

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s