Artificial intelligence
Seven lessons to deal with pandemic data
22 de March de 2021
, , ,

This is an English summary of the article by Ricardo Baeza-Yates and Karma Peiró, and published in AI4EU. The original source can be found in Catalan and Spanish.


By early 2020, almost no one had heard about coronavirus, or how the disease could change the world in a matter of weeks. Five months later, the world economy has collapsed, telework has prevailed as the solution to professional subsistence, and we interpret data on new infected, deceased or recovered people in a daily basis.

Data have become a new thermometer to understand the severity of the COVID-19 disease, to abide stricter confinement measures, and to assess the strategies proposed by the politics who govern us.However, if there’s anything that this crisis has shown, is the complexity to apply the same criteria to official data. For instance, when authorities inform about a coronavirus death, are they referring to the person who died due to COVID-19 or to the terminally ill patient who died during the pandemic period? Are the confirmed cases all that exist? Are the “recovered” really recovered?The pandemic has overwhelmed governments all over the world. The chaos of data offered by governments fuels uncertainty, because public actions are not always the right ones.

In this article, the authors review seven lessons that the coronavirus crisis has left so far. The first four are inherent to the pandemic, and the final three are valid for the data of any global crisis as they are human.

LESSON 1: Errors in data collection

Without data, it is impossible to understand how the pandemic is progressing, but either without knowing how it has been obtained.

Authors present several examples of how the information that is being used to take decisions among governments or health authorities is not complete nor standardized. In addition, the information received by the society is misleading and creates confusion.

  • About the numbers: some countries inform about the tests and others about the tested individuals, who could have been tested several times (source Our World in Data).
  • About the quality of the tests: The use of Polymerase Chain Reaction (PCR) tests, which provide a 95% of accuracy, has been recommended by WHO. However, some countries such as Venezuela, are using cheaper and faster tests of lower quality adding more imprecision on the results obtained.
  • About private laboratories: these centers do not always report to authorities, hindering the process of counting new cases collecting data.
  • About the criteria: depending on the country and the health institution, individuals have been tested with or without symptoms, with reactive approaches (when people go to the hospital) or proactive (random trial in high density locations), etc.
  • About data collection and processing: possible errors when passing information from an original source to a final destiny, where it is used to decide health policies.

LESSON 2: The inaccuracy of the data

Any analysis is inaccurate and any conclusion must be considered very carefully.

The virus takes advantage of the physical proximity between people to live. For this reason, the pandemic is a very dynamic process that depends on many factors and it is enough for each patient to infect more than one person for the infection to grow exponentially. Due to the speed at which events are taking place and the lack of knowledge of this disease, the inaccuracy of the data increases.

  • Positive cases detected and reported are far from being real. It is estimated that around 35% of the population is asymptomatic, which means that they are not aware of their infection and contribute to spread the contagion. Numbers are also strongly related to the quantity of tests in each country, most of countries could have at least twice infected people than confirmed (in Chile the factor is estimated around 5 times, in Catalonia up to 10 times)
  • The follow-up process of recovered persons is difficult, especially for those who never got tested. This is reflected in the data and might lead to confusion about the reality. For example, in UK data show few recovered individuals, while in Chile they are using formulas that overestimate numbers, which generates ethical problems.
  • In the case of deceased people, numbers are underestimated and formulas to calculate the lethality of the virus are wrong in most cases.

LESSON 3: Chaos to count the dead

What is the cause of death if a person had a previous illness and dies from COVID-19?

Authors highlight the lack of global standardization among countries to decide the cause of dead of population during the pandemic and the sources used to count them. Each country has followed a different strategy, and in most of the cases comorbidities are not taken into account.

Not all countries are testing deceased persons. In the US there is an economic incentive to declare deaths by COVID-19 since hospitals receive additional funds from Medicare in these cases. In Belgium, every deceased person who was identified as a possible positive case is already counted as a COVID-19 dead, resulting in one of the countries with more deaths per capita.

Most of the countries are only taking into account deaths reported in hospitals and health centers, but not those coming from elderly care centers or people that dies at home. In Spain, mortuaries are not required to provide data of this latter example, as well as private care centers. Depending on the sources a country uses, there can be vast differences in numbers, which cause statistical discontinuity and contribute to the inaccuracy mentioned in Lesson 2.

LESSON 4: The temporal paradoxes

The pandemic is a very dynamic process that depends on many factors, starting with civic education

The progress of the pandemic is strongly related to how society behaves in order to stop the exponential spread of contagion. This is why it is so hard to model the evolution of COVID-19. In addition, data used in models belong to the past. This delay is due to lag in testing results and the period between confirming a case and resulting in a death, which is around two weeks. So for a dead person that was never tested, the approximate date of infection is already in the past.

At the same time, countries are contemplating different possible futures. The later the pandemic affects to a country, the more information they will have about the pandemic to take decisions. For example, the prime minister of New Zealand, Jacinda Arden, declared the lockdown before even confirming a first death and it’s nowadays one of the few countries which has controlled the contagion.

LESSON 5: The importance of transparency

The transparency of the data is a reflection of the level of democracy of each government and the trust of the citizens.

In crises, particularly health crises, the transparency of data is a reflection of the level of democracy of each government and the confidence of citizens. Hiding data only creates mistrust and political problems.

In general, in the most democratic countries, microdata (or duly anonymized data at the patient level) is being shared with their citizens. This is the case of New Zealand which provides daily updates of the confirmed, and even potential, cases. There are some countries that are not providing any kind of data, not even aggregated, such as Nicaragua or Guatemala.

LESSON 6: Privacy in times of pandemic

Is data privacy the price we must pay to survive a pandemic?

This is the question that different experts ask themselves and published in a collaborative document. One of the conclusions of the study rejects this dichotomy and advocates for the possibility to reach both objectives. It also indicates that both lack of privacy and transparency from governments in the use of personal data diminishes public confidence in the state.

This is particularly important when using microdata, since it requires to delete any possible characteristic able to identify an individual in a group smaller than, say, 50 people (also known as k-anonymity). Data needs to be processed to use age ranges or geographic districts. If a geographic area has less than 50 cases, then it must be adhered to another area. The group size kcan vary depending on the desired level of reidentification risk.

Privacy has also been widely discussed with the use of mobile apps presented by different governments to localize possible contagion areas during deconfinement. There have been different solutions although currently in Europe most of the countries are working on decentralized privacy-preserving apps using Bluetooth connections.

LESSON 7: The obsession to compare with others

If the criteria differ from country to country, it is very difficult to compare them, even if they measure the same thing.

Does it make any sense to compare two sets of numbers if the rest of factors surrounding the propagation of the disease are different? This is what governments and media have done so far, but these comparisons are not fair and are distracting public opinion by providing biased information, either by design or by ignorance.

As seen in previous lessons, the strategies follow to both collect data and count confirmed cases and deaths has been very different in each country, without standard methods and normalized samples. There are other factors that adds difficulty to this, such as the population density of the areas and how it affects to the spread of contagion.

In future pandemics

Authors conclude the article by defending that data are a priority to move forward and solve problems in times of crisis. They claim that WHO should have provided trustful protocols for data collection and analysis and expect that they will learn from this experience for future pandemics. This process would require in addition to define a Data Responsible for each government to enhance communication with the society.

From this pandemic we have learned that data collection is dependent on testing policies and that the quality of the data is not ensured by its precision but by veracity. The global dimension of the pandemic has hindered the interpretation of the data. Overall, it is essential to guarantee data privacy and transparency.

The authors introduce also some ethical questions raised from this pandemic. Is social benefit more important than individual? How many people can we affect if we don’t make a collective effort? In a socio-economic crisis like the one we are experiencing, does the government have the right to bypass the acquired rights with the promise of the common good? Until when?

These lessons should help authorities to prepare emergency plans for new crisis, consider data transparency as a priority citizen’s right and guarantee access to Internet as a new human right, to prevent the digital gap adding more economic inequality to the current ones.


Ricardo Baeza-Yates has a Ph.D. in Computer Sciences and is the Director of Data Science at Northeastern University, in Silicon Valley. He is also a part-time researcher in other universities from Chile and Catalonia. He is an ACM and IEEE Fellow. 

Karma Peiró is a journalist specializing in Information and Technology Communication since 1995. Her interests are related to the ethics of Artificial Intelligence and algorithmic transparency.

About author

Karma Peiró

Continguts relacionats


Can AI create a more just world?

If we know that algorithms may have biases that ca...

Llegir més

There are 0 comments

Leave a Reply

Your email address will not be published. Required fields are marked *