How closed and incomplete data slows tech community in fighting COVID-19

Demografy is a privacy focused customer segmentation SaaS that uses AI to predict demographic data from masked names.
Unlike traditional solutions, businesses don’t need to know and disclose their customers’ sensitive information. This makes Demografy privacy by design and enables businesses to get 100% coverage of any list.
KEY TAKEAWAYS
- AI and machine learning is a frontline in fighting COVID-19
- However tech community critically lacks individual case data
- Demografy’s potential application in appending incomplete data

SOME BACKGROUND
Private companies’ fighting against COVID-19 is not limited to mass production of medical equipment, test kits development and vaccine research. Another frontline in fighting COVID-19 is data. And advances in Artificial Intelligence is our weapon in this fight.
Today the tech community is being increasingly employed in fighting COVID-19. You can find many data projects that apply AI to COVID-19 data. For example, one of the biggest initiatives is CORD-19 dataset where thousands of scientific papers about COVID-19 are fed to AI and NLP to provide medics and scientists with fast access to mission critical information. There are also plenty of others Kaggle challenges involving aggregated COVID-19 data.
But one of the biggest gaps in available COVID-19 data is a lack of individual case data. Case data would enable the tech community to build principally new solutions to COVID-19. It would make possible to build machine learning predictive models that find correlation between a person’s medical condition, symptoms, treatment and other data. For example, it could allow AI to find optimal treatment for specific demographics, predict spread of disease in areas populated with specific demographic groups and many more.
Trending AI Articles:
3. Real vs Fake Tweet Detection using a BERT Transformer Model in few lines of code
Although there are plenty of aggregated data (e.g. cases by country / state) published online, we miss case by case data. Case data can be in the form of:
case#23, symptoms, condition, location, demographics, treatment, medication, test results, x-rays, outcome.
PROBLEM OF CLOSED DATA
However, unlike CORD-19, case data is not publicly available to tech community. Even though citizen science and public involvement of tech companies, independent researchers and other parties has proven its efficiency, case data remains unavailable and useless. Perhaps only large tech companies have access to patient data via their cooperation with hospitals and authorities.
But the larger share of the tech community doesn’t have a privilege of access to such data. And sometimes this larger share of smaller entities contributes most to the solution. One of the possible excuses to not publish case data is privacy concerns about disclosing data containing at least partially individual medical records. Though we can solve this problem by stripping all personally identifiable information from this data.
Data like the one below actually doesn’t pose risks of identity theft and privacy violation:
case#23, symptoms, condition, location, demographics, treatment, medication, test results, x-rays, outcome.
And this data is especially vital in order to build real world AI tools to combat COVID-19. That would enable researchers and tech companies to build predictive models that predict best treatments for specific demographic groups. That would allow to provide vital demographic data for scientists developing vaccines. That would allow to provide decision makers with actionable analytics and insights. That would allow to develop fast decision making tools for medics to make an early, definitive diagnosis and prepare relevant treatment for each particular case. There are plenty of other possible applications that the AI community can come up with having vital case data at hand.
Case data for machine learning already has proven to be efficient in healthcare. Example of one of its many possible applications is automated prioritizing of x-ray results for expedited review by medics.

PROBLEM OF INCOMPLETE DATA
Even if we have case data published, there is another problem. Its incompleteness. Demographic data is the most important information in patient case data after medical information like condition, treatment, medication, test results, etc. It’s impossible to build demographic-centric models without having demographic data for cases at hand.
However demographics is not always present in medical records. In most cases it is limited to sex and age or racial data is not always available. For example, according to John Hopkins University of Medicine:
- 9 states don’t provide race data for confirmed COVID-19 cases
- 12 states don’t provide race data for COVID-19 related deaths
- 48 states don’t provide race data for COVID-19 testing
How can we append missing demographic data to case data? Traditional solutions include data append services or data brokers. They usually require personally identifiable information since they use consumer databases and try to match records in these databases in order to provide additional information. For obvious reasons, it’s not the case for medical case data since it jeopardizes privacy. Besides this, data append services provide low coverage and unexpected accuracy.
Possible alternative is to use privacy by design solutions like Demografy. Demografy’s key difference from traditional data append is that it relies on machine learning instead of consumer databases. It uses machine learning to predict demographics using non personally identifiable information as input. It can use only first and masked last names to hide identities. E.g. John J*son. Thus identities in list remain safe. Privacy focused solutions are must in order to make it even possible to use case data with third-parties.
In case of COVID-19 case data, we have the following proposal how Demografy can be used as privacy safe proxy between data owners and general public:
1. Healthcare organizations, authorities and other parties possessing COVID-19 case data share only first and masked last names with Demografy (without full names and without medical information like symptoms and treatment)
2. Demografy predicts missing demographic data for provided masked names
3. Tech community then has access to richer data containing both demographics (gender, age, race, ethnicity, etc) and medical information (symptoms, treatment, condition, etc) but without full names and other personally identifiable information.
However we have little to do without an action from authorities and/or healthcare organizations. They have case data but they are still reluctant to share it with the tech community.
Meanwhile Demografy offers its services for free to any non-profit COVID-19 initiative — https://demografy.com/covid19 . Contact us if you want to find out more.
Don’t forget to give us your ? !



AI, COVID-19 and data was originally published in Becoming Human: Artificial Intelligence Magazine on Medium, where people are continuing the conversation by highlighting and responding to this story.
Via https://becominghuman.ai/ai-covid-19-and-data-fdc2d0100a43?source=rss—-5e5bef33608a—4
source https://365datascience.weebly.com/the-best-data-science-blog-2020/ai-covid-19-and-data
