AI-Assisted Genome Studies Are Riddled with Errors

Researchers used artificial intelligence in large genomics studies to fill in gaps in patient information and improve predictions, but new research uncovers false positives and misleading correlations.

Written bySahana Sitaraman, PhD
| 3 min read
A DNA illustration composed of diverse human figures.
Register for free to listen to this article
Listen with Speechify
0:00
3:00
Share

The genome serves as the blueprint for the body, influencing every trait from the shape of the face to the arches of the feet, and even the development of certain diseases. While some disorders, like cystic fibrosis, are linked to single genes and can be reliably predicted based on a person’s genetic data, many others—such as autism spectrum disorder, Alzheimer’s disease, depression, and obesity— are not.

For the past 15 years, scientists have used genome-wide association studies (GWAS) to compare genomes of large groups of people to identify hundreds of thousands of genetic variants that are associated with a trait or disease.1 This method has helped scientists unravel the underlying biology and risk factors of complex diseases and has also led to the discovery of novel drug targets. Despite these advancements, GWAS studies have their limitations, which scientists have tried to address with the help of artificial intelligence (AI). However, in two studies published in Nature Genetics, researchers at the University of Wisconsin-Madison identified pervasive biases these new approaches can introduce when working with large but incomplete datasets.2,3

GWAS rely on large biobanks with extensive patient data. However, these repositories could be lacking anything from blood reports, scans, and patient history to family data. Even with a thorough survey, challenges such as the lack of data on late onset diseases in a cohort of young participants can throw a wrench into researchers’ plans.

To address gaps in the data, scientists developed two approaches: machine learning and GWAS-by-proxy (GWAX), which relies on family history data as predictors of late-onset diseases. Many researchers combine GWAS and GWAX to improve the statistical power of their predictions. However, the University of Wisconsin-Madison research team has found that these “solutions” can erroneously link gene variants with diseases.

Continue reading below...

Like this story? Sign up for FREE Genetics updates:

Latest science news storiesTopic-tailored resources and eventsCustomized newsletter content
Subscribe

“It has become very popular in recent years to leverage advances in machine learning, so we now have these advanced machine-learning AI models that researchers use to predict complex traits and disease risks with even limited data,” said Qiongshi Lu, a biostatistician at the University of Wisconsin-Madison and coauthor of the studies, in a press release.

With AI-assisted GWAS, Lu and his colleagues noticed false associations between gene variants and type II diabetes. For example, four gene variants showed a high correlation with the disease in an AI-assisted GWAS, but not when using a conventional GWAS approach. However, previous research has shown that although these genes act on a cellular pathway that is indirectly connected to blood glucose levels it does not strongly influence them.

In cohorts where all samples have genetic data but only a fraction of samples have desired phenotypic data, AI-assisted GWAS algorithms try to fill in the gaps based on learned patterns. But without knowledge of physiological intricacies, this approach can lead researchers down the wrong path.

“The problem is if you trust the machine learning-predicted diabetes risk as the actual risk, you would think all those genetic variations are correlated with actual diabetes even though they aren’t,” Lu said.

Compensating for the holes in the data banks with proxies is also problematic. For example, when analyzing the correlation of multiple traits with the risk of developing Alzheimer’s disease, Lu observed a divergence from GWAS results, which are based on actual data. A key discrepancy was the association between education attainment and the risk of Alzheimer’s disease. Multiple groups have reported an inverse correlation between these variables, a result that is backed by GWAS. However, Lu observed a positive correlation when GWAX approaches were used. The proxy-information approach also failed to show a link between the disease and lower cognition later in life, contrary to previous data and GWAS findings.

The team proposed new statistical methods that researchers can use to correct these biases and increase the reliability of their findings. They urge the research community to transparently report findings and to adopt a more rigorous and cautious outlook when drawing conclusions from these methods.

“Our group’s recent studies provide humbling examples and highlight the importance of statistical rigor in biobank-scale research studies,” Lu said.

Add The Scientist as a preferred source on Google

Add The Scientist as a preferred Google source to see more of our trusted coverage.

Related Topics

Meet the Author

  • Photograph of Sahana Sitaraman. The photograph is in grayscale. Sahana has short, curly hair, round-framed glasses, and is wearing a windbreaker jacket.

    Sahana is an Assistant Editor at The Scientist, where she crafts stories that bring the wonders and oddities of science to life. In 2022, she earned a PhD in neuroscience from the National Centre for Biological Sciences, India, studying how neurons develop their stereotypical tree-like shapes. In a parallel universe, Sahana is a passionate singer and an enthusiastic hiker.

    View Full Profile
Share
You might also be interested in...
Loading Next Article...
You might also be interested in...
Loading Next Article...
Image of a man in a laboratory looking frustrated with his failed experiment.
February 2026

A Stubborn Gene, a Failed Experiment, and a New Path

When experiments refuse to cooperate, you try again and again. For Rafael Najmanovich, the setbacks ultimately pushed him in a new direction.

View this Issue
Human-Relevant In Vitro Models Enable Predictive Drug Discovery

Advancing Drug Discovery with Complex Human In Vitro Models

Stemcell Technologies
Redefining Immunology Through Advanced Technologies

Redefining Immunology Through Advanced Technologies

Ensuring Regulatory Compliance in AAV Manufacturing with Analytical Ultracentrifugation

Ensuring Regulatory Compliance in AAV Manufacturing with Analytical Ultracentrifugation

Beckman Coulter logo
Conceptual multicolored vector image of cancer research, depicting various biomedical approaches to cancer therapy

Maximizing Cancer Research Model Systems

bioxcell

Products

Sino Biological Logo

Sino Biological Pioneers Life Sciences Innovation with High-Quality Bioreagents on Inside Business Today with Bill and Guiliana Rancic

Sino Biological Logo

Sino Biological Expands Research Reagent Portfolio to Support Global Nipah Virus Vaccine and Diagnostic Development

Beckman Coulter

Beckman Coulter Life Sciences Partners with Automata to Accelerate AI-Ready Laboratory Automation

Graphic of amino acid chains folded into proteins

Expi293™ PRO Expression System: Higher Yields Across a Wider Variety of Proteins

Thermo Fisher Logo