Visualizing COVID-19 Statistics

This is an introduction to the visualization and interpretation of COVID-19 (Corona Virus Disease 2019) timeline data. It will show you where to download timeline data, visualize them, and understand the development of COVID-19 case numbers. This essay will only cover the very basics, and it will limit interpretation to the (more or less) obvious facts. There will be no modelling or extrapolation.

All visualization will be done with the Klong COVID19 package, which has been written specifically for this purpose. If you want to conduct your own experiments, you will need the Klong interpreter, the Ghostscript interpreter, and the GV viewer (or sufficient motivation to hack the Ghostscript invocation in the Klong package).


Getting the Data
World-Wide Development
Exponential Growth
Long-Term Development
Local Developments
More Artifacts
Growth Curves

Getting the Data

I am using the timeline databases compiled by the Center for Systems Science and Engineering of the Johns Hopkins University. The databases in CSV format can be downloaded here:

The databases have to be stored locally under different names. See the Klong COVID19 package for details.

Once the databases have been fetched and renamed, the COVID19 package can be loaded into Klong by running

kg -l covid19

Instructions for plotting the data will be included in the subsequent sections. Most panels can be recreated by typing


where "Location" is any country, region, or province contained in the database.

World-Wide Development

There are three timeline databases reflecting the basic progression of a disease inside of a population: people get sick, most of them recover, and some of them die. The x-axis of the plots show time (days since the beginning of the data collection) and the y-axis shows case totals. Hence, each bar indicates "cases at a given point of time". The following plot shows the world-wide cumulative case numbers:

These colors will be used to plot numbers in all panels in this text:

yellowtotal cases
pinkreccovered cases

You can click on any panel to view a full size version of the plot and the Klong/COVID19 instructions that were used to generate the plot.

As you can see, the number of cases world-wide is growing steadily with a dent in the middle of the total (confirmed) case curve. We will discuss this later. The number of people who recovered from COVID-19 also grows steadily. So does the number of deaths, although – fortunately – at a much slower rate.

The difference between the total number of cases and the total number of recovered cases is the number of people that is sick at a given point of time. This corresponds to the visible parts of the yellow bars in the above figure. Let us call these the active cases. Active cases are a very interesting statistic. They are plotted in orange in the subsequent panels. The following panel shows the active cases in the place of of the recovered cases.

Note that the active cases are not a cumulative statistic. Total cases, recovered cases, and deaths are cumulative numbers, so they can never grow smaller. Active cases can – and eventually will – grow smaller.

Exponential Growth

"Growth" is the difference between two bars in a bar plot. When the difference remains (more or less) the same, the growth is "linear". When the growth itself grows, it is "polynomial" or "exponential". The difference between polynomial and exponential growth is that exponential growth is always faster when given enough time. The following plot shows some exponentially growing curves.

The solid line may not look like it is growing fast, because its formula has a small base. The dashed and dotted lines have larger bases and therefore grow faster. However, even the solid line will take the shape of the dotted line when given enough time, and because its growth itself grows, this may happen much earlier than expected! Look closely at the other two lines: they all started with a rather flat inclination!

The problem with exponential curves is that without some solid mathematical background, you will most certainly misjudge their growth. Here is a well-known thought experiment: There is a water lily on a lake. The number of lilys doubles every day. After 20 days, the entire lake is covered with lilys. On which day is half of the lake covered with lilys? Try to find the answer yourself before looking at the solution!

This is why, when confronted with an exponential growth problem, it is important to act as soon as possible, even if things do not look too bad.

In Italy, measures to contain the spread of COVID-19 were enacted a little bit too late, and now the population of the country is paying a bitter price, as stated in Italy's open letter to the international scientific community.

Here are the statistics for Italy, and an exponential curve fitted on the total case statistic. (Do not use the model for extrapolation, it will be wrong!)

Long-Term Development

In every finite environment, which includes us, our environment, our economy, and even the SARS-CoV-2 virus, exponential growth has a natural limit. This means that the exponential expansion of the disease cannot continue forever. At some point, growth must slow down and, finally, come to a standstill.

This must, of course, be the case when there are no more people left to infect, but in reality this will happen much sooner, and with the help of some countermeasures, a whole lot sooner.

So in the long term, the growth of the total cases will become quasi-linear, and then "logarithmic". Logarithmic growth is what happens when the difference between two neighboring bars in a bar plot decreases and finally tends toward zero. The following graph shows an S-shaped sigmoid curve that models this progression.

The dashed line over are orange area is the number of active cases over time and the sigmoid curve itself reflects the total number of cases over time (it is the integral of the dotted curve). The turning point where exponential growth turns into logarithmic growth is at zero. This is the point where the active case number starts to decline.

In the Hubei province in China, the development has already progresed past this point and active cases decline steadily, as can be seen in the next panel.

This will happen everywere eventually, but there still are a few things we can do and must do, if we want to avoid a large-scale catastrophe.

The most important thing is to slow down the growth of the case numbers. The most critical factor in the spread of the virus is proximity. If people do not come too close to each other, the virus will not spread. Another critical factor is hygiene, that is desinfecting and/or washing hands after contact to people and not touching one's face. Large-scale lockdowns are the most promising countermeasure to slow down the spread of the virus.

This countermeasure may not have a large effect on the total number of people affected, but it will have a large effect on the number of people who are sick at the same time. Have a look at the following graph:

When the infection rate is low (people avoid each other), the curve is long and flat. When the infection rate is high (people keep meeting as usual), the curve is steep and short.

The problem with the steep curve in the example is that at some point 50% of the population will be sick. Even if only one percent of them would need hospital care, this would overwhelm even the best health care system (and the actual number is probably much, much higher). In the flat curve, no more than 12.5% of the population is sick at any point of time. To cope with this epidemic in the best possible way, the curve should be as flat as possible.

Local Developments

At the time of writing this text (2020-Mar-16), the development looks pretty much the same everywhere in the world, except for Hubei province and a few Asian countries, where the development has progressed further and/or lockdown/quarantine was implemented early. Here are, for example, the total, active, and death statistics for France, Germany, and the USA:

Except for an artifact (the point where case numbers suddenly jump) in the statistics for the USA, the curves look almost identical. The artfact may be there because of a policy change regarding the large-scale administration of tests or due to delayed reporting.

Because the development is at its beginning, the total case curve is very steep and there are almost as many active cases as total cases. This is because few people had the chance to recover during such a short period of time. Compare this to Italy, where some people already start to recover. This shows as yellow tips on the otherwise orange bars.

In locations where the disease has spread for a longer time, the total case curve will start to separate visibly from the active case curve as more and more people recover. This is what can be observed, for example, in Japan:

The curve is clearly exponential, but with a rather small base (infection rate). Given the time since the virus started to spread in Japan, this indicates pretty effective countermeasures.

The curve of South Korea already seems to have reached the logarithmic phase. Whatever they are doing, it is a good idea! (It could be an artifact, but given the sample size and the smoothness of the curve I think it is an actual development!)

Many Asian countries will show a development that reflects the world development, most probably due to their proxmity to China. Active cases dropped when the spread was contained in Hubei, but then started to increase again as local transmission outside of China became more common. This can be seen, for instance, in Thailand:

When the spread of the disease is being contained or slowed down, active case numbers will decrease or at least stagnate. This can be observed, for instance, in the statistic for Hong Kong:

While there is still growth, the active case numbers remain more or less constant. However, they do not decline either, so it is hard to make a prediction about the further development. The ragged nature of the active case curve is due to the small sample size.

Small sample sizes (cae numbers) often lead to curves that do not seem to be clearly exponential, like the following example (Iceland) illustrates:

Growth is not really logarithmic at the beginning of this curve. Growth is always non-linear (it varies slightly) and only converges toward a clean curve over time or when large samples are given. Iceland is a small land, so the variance in growth (the "raggedness" of the plot) can dominate the exponential nature of the curve.

More Artifacts

An artifact, in science, is a product of "artificial" origin, i.e. something that was added by some source that is external to the subject under investigation. In statistics, this is most of the times the result of erroneous measurement or small sample sizes.

If you see a plot like the below one (the COVID-19 timeline of Nigeria), you might as well discard the entire data set. The staircase-like outline is a dead give-away for a sample size that is way to small, and the numbers are so low that they certainly do not reflect the actual progression in that area.

However, there are more sublte artifacts, and we have already encountered a few in the previous sections. For instance, the staticts for Italy contain the same value twice at days 49 and 50:

The same phenonenon can be observed on the graph of France and, curiously, on the same days. Such duplication typically happens when no values are being reported for one day. Such an artifact is usually followed by a jump in the graph, where the cases for two days are reported simultaneously.

A similar artifact can be found in the graph of Hubei on day 21 and, because most infections took place in Hubei at that time, the artifact is even visible in the world statistics:

This artifact looks weird, because the growth spike after the duplicated value seems too big. It is also too big to be explained by variance. Maybe there was some policy change that caused additional cases to appear, like additional testing for the SARS-CoV-2 virus taking place. If you have any information about this event, please let me know! The date of the jump would be 2020-Feb-12.

A jumps in a graph with a big sample size (≥1000) should always make you look twice and wonder why it appears. Another dramatic jump appears on day 48 in the confirmed case graph of the USA. It is most certainly caused by more testing rather than actual progression:

Growth Curves

A "growth curve" does not display the number of cases, but the growth of cases at a given point of time. That is, it displays the amount by which the corresponding statistic grew since the last point of measurement. The following graph displays the cumulative cases world-wide as well as the corresponding growth curve in green.

Growth curves can be useful for visualizing growth in exponential developments. The following is the growth curve for total and active cases world-wide. It is basically the same as the growth curve in the above panel, only stretched across the y-axis to make the growth more distinct.

The left half of the curve shows a roughly bell-shaped curve, where active case numbers decline toward the middle of the graph. This is because the Hubei province dominated the world curve up to this point.

Then there is a stetch from the middle of the graph to 3/4 of it, where case numbers grow, but active cases stagnate. This is because recovering cases in Hubei compensate for new cases in the rest of the world. Finally, recovery in Hubei (which is logarithmic) is dominated by the exponential growth world-wide, and growth turns exponential again.

Eventually, the curve will turn into a camel curve, a curve with two "bumps" or peaks. The right bump will be larger than the left one, though, because the world pupulation is larger than the population of Hubei.

Note that growth curves are always ragged, because growth rates typically vary from day to day. There are some suspicious spikes in the graph, though, especially the one in the declining half of the Hubei curve.

Here are some more growth curves. First Hubei, without the rest of the world. First recovered case growth starts to dominate total case growth, so the active case growth goes toward zero (and in fact becomes negative, but the graph does not show that). Then total case growth also stagnates and eventually goes toward zero.

Nothing peculiar is going on in Germany, except for one lonely spike that is probably due to a delay in reporting.

A gap in the graph of Italy is probably also due to delayed reporting. Note that, unlike in Germany, total cases and active cases start to separate.

Swedens graph is quite ragged due to its small sample size. The decline on the right side may be variance or it may be an actual development. It is hard to say given the small sample.

contact  |  privacy