Jan 14, 2021

Interpreting Spatial Data in the Age of COVID-19

As 2020 has come to an end, many are eager to leave the mess of COVID-19 behind with the new year and gain a fresh start. Unfortunately, new cases are still soaring across the United States and, even with vaccines, pre-COVID life is likely to remain elusive through much of 2021. Luckily, there are many tools available to track the spread of COVID-19 while we wait for normalcy. Health organizations, major news outlets, and even search engines have COVID-19 dashboards that are updated daily. Despite displaying the same data each dashboard uses its own blend of data aggregation and visualization technique, some of which are easier to digest than others. This post will review spatial visualization methods and give recommendations on how to interpret spatial data to get the most out of these dashboards.

1) Bubble Charts vs. Choropleths
A quick search of “covid cases” on any major search engine will return multiple dashboards that heavily rely on two map visualizations – bubble charts and choropleths. Bubble charts are just as they sound, a map that displays a circle or “bubble” centered over a geographic region and sized according to that region’s statistic. Choropleths are fundamentally different, relying on coloring regions based on a range of numeric values displayed in a legend on the side. Choropleths are usually criticized for their geographic bias – it’s much easier to read the number of COVID cases in larger states like California than for smaller ones like Rhode Island. When interpreting choropleths, it’s important to remember the eye is naturally drawn to larger regions and smaller regions deserve equal scrutiny. Bubble charts get around this issue by placing the focus on the size of the bubble, not the underlying region with which the bubble is associated. However, a quick search of “covid cases” on Google will reveal bubble charts’ major flaw – without altering the scale of data to fit the visualization (a process called normalization), the bubbles can easily overlap and make the visualization largely unreadable. Most COVID-19 dashboards provide raw data without normalization, which often leads to a mess of overlapping bubbles like the picture below. Because of this issue, avoid dashboards that rely heavily on bubble charts in favor of more readable chloropleths.

Bubble chart from Google’s COVID-19 dashboard.
Choropleth map from searching “covid cases” on Yahoo.com

2) Geographic boundaries are often arbitrary
As was hinted above, the underlying geography represented in a spatial visualization can affect how to interpret the data. Spatial data is often tied to arbitrary geographic boundaries. When considering cumulative COVID-19 cases per region as seen in the maps above, it would also be useful to account for the population size and density per region. However, each state varies in both geographic and population sizes. It would make more sense to create new boundaries for the map so each region encompasses equal population sizes, but since US COVID-19 data is broken up by counties and states this is nearly impossible. Instead, the data can be displayed as a rate of COVID cases per 100,000 population. John’s Hopkin’s dashboard of the United States has several maps that display cases per population at the county level. Just remember that not every county is the same size geographically, which can lead to a bias toward larger counties during interpretation.

3) Temporal aggregations are important
When looking through dashboard views it’s important to keep in mind the aggregations of statistics by time. Most dashboards have a default view of the cumulative COVID-19 cases by region, but also provide more statistics aggregated to the last day, week, or two weeks to show recent trends. However, these dashboards typically don’t have the same aggregations and that can make a huge difference in display. For example, John’s Hopkin’s dashboard has a map of new cases by population for the last day, the CDC provides a map of new cases by population over the last seven days, and Google’s dashboard provides an aggregation for the last two weeks. Depending on the trends in a specific location, these can provide drastically different maps. Data of COVID-19 cases within the last day can be biased from collection and reporting procedures – not all counties report data as quickly as others. This can distort the map and cause drastic day to day changes. Aggregating counts to the past 7 or 14 days account for this issue but have their own nuances for interpretation. A surge in cases one week will lag in the two week display, but a one week surge doesn’t indicate the emergence of a long-term trend. None of these aggregations are wrong, and each have their own merits, so keep in mind how the data is aggregated when making interpretations.

New cases by population over the past day by John’s Hopkin’s

4) Keep your original question(s) in mind
It is easy to get lost in the multitude of visualizations that make up most COVID-19 dashboards and forget why you went to the page in the first place. Are you interested in the risk of getting COVID-19 from going into the office, or visiting friends or family? Or perhaps you want to view nation-wide trends in cases over the past few months? Keeping the original question in mind can help narrow down the search for the visualization that will help you the most. To understand your local risk, a cumulative case map aggregating to the state level will not be nearly as helpful as one that aggregates to the county level over the past two weeks. For nation-wide trends, many dashboards include state level histograms that record cases over time which are more useful than a map of cumulative cases – just remember these visualizations are still impacted by geographic bias. Whatever your question, keep it in mind so you can filter out the visualizations that are not useful to you.

5) Ask what a visualization does not provide
A useful tip for finding the right visualization might be to ask what information it does not provide. A map of cumulative cases does not indicate the date of a region’s first recorded case, leaving out information on the intensity of the pandemic by location. For example, COVID-19 will have a much different impact on two states with equal current case counts if one recorded a steady number of cases over a greater period of time than another that had a sharp increase over the past few months. Asking what a visualization doesn’t provide can contextualize its limitations and make sure you are getting the information you are most interested in.

Hopefully you will find these recommendations useful as we wait for this pandemic to end. Stay safe and healthy!

About the Author

Michael Wallace profile.
Leave a Reply

Your email address will not be published. Required fields are marked *

Related Blog Posts
Why we started using JSON Schema in our internal Helm charts
Helm 3 supports validating the provided chart values against a JSON Schema. While it may be quicker to get started in your chart development without a schema, we found it valuable for a number of […]
Rewriting files in Google Cloud Storage
Rewriting Files in GCP Note: even though this code is in Python, this should be the same idea in JavaScript, Go, etc. I wrote the following to copy a file from one Google Cloud Storage […]
Building a Better Mousetrap
Recently, my daughter (age 6) was into building “mousetraps” out of shoe boxes. These were more or less comfortable cardboard mouse houses complete with beds, rooms, everything a mouse could want or need and not […]
ARM Wrestling Its Way Into Mainstream Software Development
Nearly all smart phones have been running ARM-based processors for years. They provide superior power for the amount of power consumed, and thus extend battery life. With Apple’s recent release of the Apple Silicon M1 […]