Academic Research on COVID-19 Pandemic will be Flawed Unless We Fix Data Collection Issues
Here at Incite Analytics, we've been howling about flaws in U.S. state and national datasets for the past two months. One of my greatest fears during the pandemic is not that it will worsen and continue (though of course, it likely will continue apace in the coming months). As tragic as more death is and will continue to be, I fear that all of the pain, death, loss of civil liberties and economic hardship suffered at home and abroad will be for naught.
In order to honor the sacrifices and loss that we have suffered, we must position ourselves to learn from the pandemic. Pandemics on this scale will happen again; our challenge will be turning the scars of our current circumstances into wisdom that we might do better in the future.
The crux of the issue lies in the testing, case, and mortality data that describe the shape and scale of the pandemic over time. Almost all of the reporting on the pandemic relies upon reporting "daily tests/cases/deaths observed" on a given date.
So what's the problem?
Reporting "daily observed" figures tells us when we learned about a given set of tests, cases, or deaths, but not when they happened. The difference is crucial, because tests, cases, and deaths have a different degree of reporting lag (how long it takes to be reported to public health authorities, etc.) depending on when and where they are reported. For instance, on 6/24/20, the Arizona Department of Health Services reported that it observed 78 new deaths. This was then misreported everywhere as though 78 people died on June 24th. What actually happened? Only 16 of those deaths occurred in the week between 6/17/20 and 6/23/20. Nearly 80% of those deaths were over a week old. The earliest in that series occurred on 4/11/20. Further, AZDHS reports that the highest death count on any given day (by occurrence, not reporting) is 33 deaths on 6/15/20.
This method of reporting "date of observation" rather than "date of occurrence" will correctly identify the total cumulative number of tests, cases, and deaths. However, it will unarguably distort the pattern of when events actually happened. In my prior analysis of the COVID Tracking Project's figures, I found that their reporting overstated June deaths in Arizona by as much as 60%, while significantly understating testing. We will be unable to accurately assess the impacts of lock-downs, mandatory mask ordinances, and other measures to slow the spread of the pandemic, because the time-series through the entire pandemic will be garbled.
This issue is perhaps most clearly encapsulated by recent changes to death reporting in many states. As of 6/25/20, many states are now including "probable deaths" in their reporting figures. However, rather than attributing those deaths to the dates they actually occurred, media organizations and data aggregation services are merely lumping the data into one day. The screenshot below comes from the New York Times' reporting on the state of New Jersey.
See the glaring, yellow flaw? The Times is reporting over triple the next highest death count, from the worst days of the outbreak, without bothering to properly back-fill or attribute those 1,877 deaths. Incidentally, this is more COVID-19 deaths than the entirety of Arizona's pandemic, which currently stands at 1,588 total deaths, though you'd never know it given how our state has been discussed recently. Those 1,877 deaths occurred between March and June, but are all recorded as though they happened on 6/25/20. This is how their data aggregation appears to treat every day of collection.
So how do we fix this?
It's going to be challenging and will require a lot more work on behalf of those claiming to be our sense-making apparatus. The reason data is collected by date of reporting, rather than attribution is because it's a lot easier. All you need to do is subtract today's total of reported deaths from yesterday's reported total, and that gives you today's figures. Nothing else needs to change or be updated. This also has the benefit of giving you a fresh news story every day that allows you to speak as though you're talking about current events, when in fact you have no idea what the situation is for the past week, because the information hasn't yet been collected by local agencies.
If they were to record the figures by when they occurred, they would have to update all of their daily records, for each day over the entire pandemic. It's a lot harder, but is much more accurate: while the current system can only give us reliable cumulative totals, recording COVID-19 data by the date it occurred will not only give us accurate totals, but also preserve the shape of events as they actually happened. If institutions were really serious about informing researchers, they'd archive and timestamp every day's reporting and present the entire set of information for every day of the pandemic.
You don't get to call yourself the paper of record, and then obviously and incompetently fail at recording the largest social, economic, and public health crisis in in our lifetime. The current state of reporting on the pandemic is perhaps best described as "factual, but not truthful." If we are to learn the lessons of this pandemic, we must have the full and complete context of the outbreak, not sloppy record keeping which has the sole purpose of fanning the flames of panic, rather than informing the public and policy makers.
Stay safe everyone,