Advancements in genetic research have led to the development of new tools for predicting an individual's risk for complex diseases. One such tool is polygenic risk score (PRS) analysis, which uses information from multiple genetic variants to generate a personalized risk score for an individual. In this post, we will explore the basics of PRS analysis, its potential applications in healthcare, and its limitations.
A polygenic risk score is a numerical estimate of an individual's genetic risk for a particular trait or disease based on multiple genetic variants spread across the genome. These genetic variants are typically common in the population, and each has a small effect size on the risk for the trait or disease. By combining information from multiple genetic variants, PRS analysis can provide a more accurate estimate of an individual's risk for a particular disease than traditional risk factors such as age, sex, and lifestyle factors.
Knowing whether your genetic background increases your risk to develop certain diseases may help you make important decisions about your health.
We all have near-identical DNA sequences. What makes us unique are slight differences in our DNA that are called genetic variants.
Our DNA’s code is made out of 4 chemical building blocks – A (adenine), T (thymine), C (cytosine) and G (guanine). A genetic variance occurs in a location within the DNA where that code differs among people.
For example, if Person #1 has a “A” in the same location in the DNA code where Person #2 has a “T”, that’s a genetic variant.
There are roughly 4 to 5 million genetic variants in an individual’s genome. Not all of them are unique, some of them occur in others as well. Some variants increase the risk of developing diseases, while others may reduce such risk. Others have no effect on disease risk at all. Let’s take a closer look at possible connections between genetic variants and diseases.
Genetic variants can also impact our risk of developing certain diseases — these are called risk variants.
To discover risk variants, scientists compare the genetic codes of people without a disease to people with a disease.
If a genetic variant occurs more frequently in people with a disease, it is associated with increased risk.
If a genetic variant occurs more frequently in people without a disease, it is associated with decreased risk.
The process is split into three main sections – QC of base and target data, Calculation of PRSs and Interpretation and presentation of results – providing recommendations for best practice in PRS analyses (summarized in the Figure below).
PRS analyses require two main input data sets:
(1) Base data (GWAS) – contains summary statistics of single-nucleotide variants; and
(2) Target data – consists of genotypes and usually phenotype(s) in individuals from a sample independent of the GWAS sample. The quality of these datasets determines the power and validity of the PRS analyses and therefore they must undergo several quality control (QC) steps. The researcher’s recommendations for QC are outlined below:
● Heritability check – only perform PRS analyses on GWAS data with a h^2 SNP >0.05.
● Effect allele – the identity of the effect allele must be obtained from GWAS investigators.
● To minimize generation of misleading results, only perform PRS analyses that involve association testing on target sizes of ≥100 individuals.
● File transfer – ensure files have not been corrupted during transfer.
● Genome build – ensure that SNPs from both datasets have genomic positions assigned to the same build.
● Standard GWAS QC – follow established guidelines to perform standard GWAS QC.
● Ambiguous SNPs – remove all ambiguous SNPs to avoid introducing systematic errors.
●Mismatching SNPs – strand-flip the alleles; most PRS software perform strand-flipping automatically for SNPs that are resolvable and removes those that are not.
● Duplicate SNPs – ensure there are no duplicates to avoid errors and system crashing
● Sex chromosomes – remove sex chromosomes’ SNPs if analysis is looking at autosomal genetics only.
● Sample overlap – remove overlapping samples to avoid inflation; researchers recommend judicious use of target samples.
● Relatedness – remove any closely related individuals to avoid inflation.
PRS analysis is a statistical approach that integrates information from thousands of genetic variants to generate a personalized risk score for an individual. The process of generating a PRS involves several steps:
● Identification of genetic variants: The first step in generating a PRS is to identify a set of genetic variants that are associated with the trait or disease of interest. This is typically done through genome-wide association studies (GWAS), which involve comparing the genomes of large numbers of people with and without the disease to identify genetic variants that are more common in the affected individuals.
● Estimation of effect sizes: The next step is to estimate the effect size of each genetic variant on the risk for the trait or disease. This is typically done by calculating the odds ratio or beta coefficient of each variant in the GWAS dataset.
● Weighting of effect sizes: The effect sizes of the genetic variants are then weighted based on their frequencies in the population. Rare variants are typically given a higher weight than common variants, as they are more informative for predicting an individual's risk.
● Calculation of the PRS: The weighted effect sizes of the genetic variants are then summed up to generate a polygenic risk score for an individual. The PRS is typically standardized to have a mean of zero and a standard deviation of one, so that it can be compared across populations.
Most people have average genetic risk of disease.
Your score can be higher than average meaning that you have increased genetic risk of disease compared to most people.
If your polygenic score is in the 95th percentile, you do not have a 95% chance of developing the disease. Rather it means that — out of 100 people — your polygenic score is higher than 95 people and the same or lower than 5.
Your score can be lower than average meaning that you have decreased genetic risk of disease compared to most people.
If your polygenic score is in the 5th percentile, you do not have a 5% chance of developing the disease. Rather it means that — out of 100 people — your polygenic score is higher than 5 people and the same or lower than 95.
The majority of genomic studies to date have examined individuals of European ancestry. Because of this issue, there may not be adequate data about genomic variants from other populations for calculating a polygenic risk score in those populations. This historic lack of diversity in genomic studies is also a concern for other genomics-related research areas and contributes to a widespread concern about increasing health disparities beyond polygenic risk scores.
At this point in time, the accuracy of polygenic risk scores may only be valid and useful for European ancestry populations. More research is needed to derive the data for making polygenic risk scores useful for other populations.
Polygenic scores may be useful tools to assess risk for important diseases, such as coronary artery disease — a leading cause of death in the U.S. and globally.
Coronary artery disease occurs due to buildup of plaque in the blood vessels that supply oxygen-rich blood to the heart muscle.
Starting as early as our 20's, plaques build up over time and ultimately increase risk of a totally clogged vessel, referred to as a myocardial infarction or 'heart attack.' In the U.S., about 5% (1 in 20) of individuals develop CAD by age 50, and up to 25% (1 in 4) develop CAD by age 80.
However, there are some limitations to polygenic risk scores, such as potential inaccuracies and the need for further research to fully understand their clinical utility. Despite these limitations, polygenic risk scores hold promise as a tool for improving personalized healthcare and disease prevention.
At Predera, we believe that machine learning and AI-based solutions hold great promise for the development of polygenic risk scores. These scores are based on the analysis of large amounts of genomic data, and machine learning algorithms can be used to identify the genetic variants that contribute to disease risk. We are committed to advancing the field of polygenic risk scores using the latest machine learning and AI-based tools, with the ultimate goal of improving human health and quality of life.
Stay tuned for Part 3 of our blog, where we will be having a step-by-step tutorial guide to performing basic polygenic risk score (PRS) analyses.