The UniPD research team creates synthetic genomes for imaginary humans


Generating fragments of artificial genomes with real-world characteristics taken from an existing genomic database, are the results of the study published in "PLOS Genetics." The study was conducted by an international research team from the University of Tartu (Paris) and Prof Luca Pagani of the University of Padua Department of Biology for which heis a professor of Computational Anthropology.

The goal is not to build a superhuman in a laboratory, or rather in a computer, but instead to be able to advance biomedical research by providing a genomic data platform that is currently unavailable or accessible. In recent years, thanks to new algorithms, artificial intelligence has managed to replicate complex models taken from the real world that can generate high-quality "synthetic data", including realistic images of fake humans (e.g.https://thisxdoesnotexist.com/). If the same techniques of artificial intelligence could also be applied to biology, then these artificial beings could also be given a genetic heritage. "Existing genomic databases are an invaluable resource for biomedical research, but they are either not publicly accessible or shielded behind long and exhausting application procedures due to valid ethical concerns. This creates a major scientific barrier for researchers. Machine-generated genomes, or artificial genomes as we call them, can help us overcome the issue within a safe ethical framework," said Burak Yelmen, first author of the study and Junior Research Fellow of Modern Population Genetics at the University of Tartu.

The team used two main approaches to generate artificial genomes. Using real data from a genomic database and given a training set, they first trained generative adversarial networks (GANs), to work in such a way that it ‘learns’ to generate new data from the same statistics as the training set. A current example of GAN use is the ‘construction’ of new photographs that appear superficially authentic to human observers because they incorporate realistic features of the photo database from which elements are taken.Then they used Restricted Boltzmann Machines (RBMs), which are learning probability distributions from input data that included several parameters, and when applied to data distribution, can provide a representation of it. The multidisciplinary team performed several analyses to compare the characteristics of artificial genomes with those of real genomes.

"As surprising as it may seem,” continues Luca Pagani, a senior author and University of Padua professor, "the genomes emerging from random noise mimics the complexities that we are able to observe within real human populations and, for most properties, they are indistinguishable from other genomes taken from the biobank and used to train our algorithm, except for one detail, they do not belong to a gene donor," A non-secondary problem that the study wanted to verify was the protection of personal and sensitive data contained within the original database and somehow artificially assimilated into the artificial genome produced. Asking if, the similarity of artificial genomes compromises the privacy of the subject to which the real genome belonged? The question is particularly articulated and complex because it is not formalized in detail, but rather asking if the voluntary publication of a single person and his or her genome damages the privacy of another?

"Although detecting privacy leaks among thousands of genomes could appear as looking for a needle in a haystack, combining multiple statistical measures allowed us to check all models carefully. Excitingly, the detailed exploration of complex leakage patterns can lead to improvements in generative model evaluation and design, and will fuel back the machine learning field," said Dr. Flora Jay, the coordinator of the study and CNRS researcher in the Interdisciplinary computer science laboratory (LRI/LISN, Université Paris-Saclay, French National Centre for Scientific Research).