Synthetic data: How could it be used for infectious disease research?

Authors: Styliani-Christina Fragkouli, Dhwani Solanki, Leyla J Castro, Fotis E Psomopoulos, Núria Queralt-Rosinach, Davide Cirillo, Lisa C Crossman

Published: 2024-07-03 17:13:04+00:00

AI Summary

This research paper provides an overview of the current and future applications of synthetic data in infectious disease research. It explores the benefits of synthetic data, such as improved data privacy and reduced bias in machine learning models, and discusses various methods for generating synthetic data.

Abstract

Over the last three to five years, it has become possible to generate machine learning synthetic data for healthcare-related uses. However, concerns have been raised about potential negative factors associated with the possibilities of artificial dataset generation. These include the potential misuse of generative artificial intelligence (AI) in fields such as cybercrime, the use of deepfakes and fake news to deceive or manipulate, and displacement of human jobs across various market sectors. Here, we consider both current and future positive advances and possibilities with synthetic datasets. Synthetic data offers significant benefits, particularly in data privacy, research, in balancing datasets and reducing bias in machine learning models. Generative AI is an artificial intelligence genre capable of creating text, images, video or other data using generative models. The recent explosion of interest in GenAI was heralded by the invention and speedy move to use of large language models (LLM). These computational models are able to achieve general-purpose language generation and other natural language processing tasks and are based on transformer architectures, which made an evolutionary leap from previous neural network architectures. Fuelled by the advent of improved GenAI techniques and wide scale usage, this is surely the time to consider how synthetic data can be used to advance infectious disease research. In this commentary we aim to create an overview of the current and future position of synthetic data in infectious disease research.


Key findings
The review highlights the potential of synthetic data to address challenges in infectious disease research, such as data privacy and bias in machine learning models. It shows applications across various areas, from diagnostics to pandemic modeling, and emphasizes the need for continued development and validation of synthetic data generation methods.
Approach
The paper reviews existing literature on the use of synthetic data in infectious disease research. It examines different methods for generating synthetic data, including GANs and VAEs, and discusses its applications in areas such as COVID-19 diagnostics, wastewater surveillance, and pandemic modeling.
Datasets
Various datasets are mentioned, including COVID-19 X-ray image datasets, wastewater genomic data, and real-world COVID-19 patient data from sources like the N3C and UK CPRD Aurum database. Specific details on the datasets used in the cited studies are not provided in this paper, only a discussion of their use in developing synthetic data.
Model(s)
Generative Adversarial Networks (GANs), Variational Autoencoders (VAEs), and Large Language Models (LLMs) are mentioned as methods for generating synthetic data. Specific model architectures are not detailed.
Author countries
Greece, Germany, The Netherlands, Spain, UK