Genomics and Genome-Wide Association Studies (GWAS) have revolutionized our understanding of human biology and disease. By studying the entire genome of individuals, researchers can identify genetic variations that are associated with different diseases, traits, and behaviors. This approach has led to groundbreaking discoveries in fields ranging from medicine and agriculture to forensics and evolution.
Genome-wide association studies (GWAS) are a powerful approach for identifying genetic variations associated with a particular disease phenotype. GWAS involve scanning biomarkers such as single nucleotide polymorphism (SNPs) from the DNA of many people in order to find genetic variations associated with the disease phenotype. Once new genetic associations are identified, researchers can use that information to develop better strategies for detecting, treating, and preventing diseases. Genomic data is becoming increasingly cheaper to generate due to advances in sequencing technology, while the cost of generating high quality phenotypic data is increasing.
To conduct a GWAS, researchers first identify the disease phenotype and group participants into two groups: cases (people with the disease phenotype) and controls (people without the disease phenotype). DNA samples are then obtained from all participants and lab machines are used to quickly survey each participant's genome for genetic variation, which are called single-nucleotide polymorphisms (SNPs). The frequencies of SNPs are then calculated for both the cases and the controls and an odds ratio is computed. If the p-value is small, then the variation is deemed to be significant and the associated genetic variations can serve as powerful pointers to the region of the human genome that may cause the disease.
The associated genetic variations can serve as powerful pointers to the region of the human genome that may cause the disease. Here's an example showing more details on how the GWAS is computed.
We first identified the cases and controls. That is, the people with the disease phenotype and the people without the disease phenotype. In this case, we have 4000 patients with the disease phenotype and 6000 patients without the disease phenotype. Then we iterate over all the SNPs to compare the relevant frequencies. For instance, for SNP1 for the control group, we have 2676 out of 6000 has the corresponding variation G, at this location. And the frequency of G in this case is 44.6%.
In the case group, we have 2104 out of 4000 with the corresponding variation G at this location. So, the frequency is 52.6%. If we go through the calculation, we'll find now the P value is 5 times 10 to the minus 15. Which means, this is extremely significant. We can conduct the same calculation on SNP2 and find out the P-value here is 0.33 which is not significant. To support GWAS study, we need to know high quality phenotypes on the cases and controls in order to perform this calculation, that's why phenotyping algorithm, is very important.
Types of data :
Generally, genomics data comes in three categories below
Sequence :
Genome Sequence data is the genetic information stored in an organism's DNA. It's obtained by reading the order of the four DNA bases (A, T, C, G), which can reveal information about an organism's genes, functions, and evolution. Genome sequencing techniques include whole genome, targeted, and RNA sequencing, and the resulting data can be used to gain insights into various areas of biology, such as human health, agriculture, and environmental conservation.
Annotations :
Descriptions of features – e.g. genes, transcripts, SNPs, start codons – that appear in genomes or transcripts. Annotations typically include coordinates (chromosome name, chromosome positions, and a chromosome strand), as well as properties (gene name, function, GO terms, et c) of a given feature. This data is crucial for understanding diseases and developing new therapies, and is an essential step in interpreting genomic sequencing data.
Quantitative Data :
Any kind of numerical value associated with a chromosomal position. Quantitative data associates values with chromosomal coordinates, it can be considered an annotation of sorts. It is therefore important again to make sure that the coordinates in your data file match the genome build used by your feature annotation and/or read alignments.
1. PLINK format (.ped/.map) : This is a widely used format for GWAS data and includes both genotype and phenotype information. The ".ped" file contains the genotype data, while the ".map" file contains the genomic coordinates of the genetic markers.
2. VCF (Variant Call Format) : This format is used to store information about genetic variants and is commonly used in GWAS studies. It includes information on the genotype of each sample at each genetic variant, as well as information on the variant's quality and other attributes.
3. BGEN format : This format is a compressed binary file that stores both genotype and allele frequency data. It is commonly used in large-scale GWAS studies due to its efficient storage and processing.
4. PLINK binary format (.bed/.bim/.fam) : This is a binary format that stores genotype information in a compact format, which makes it suitable for large-scale GWAS studies. The ".bed" file contains the genotype data, while the ".bim" file contains the genomic coordinates of the genetic markers and the ".fam" file contains the sample information.
In conclusion, the advancements in genomic sciences and GWAS studies, powered by machine learning and AI-based solutions, have revolutionized the field of medicine and have led to significant breakthroughs in disease diagnosis, prevention, and treatment.
At Predera, we believe that machine learning and AI-based solutions are the key to unlocking the full potential of genomic sciences and GWAS studies. These technologies are providing us with new tools to analyze and interpret vast amounts of genetic data, leading to exciting discoveries in the field of medicine. We are committed to leveraging these technologies to continue advancing the understanding of the human genome and to developing new treatments and cures for diseases.
Stay tuned for Part 2 of our blog, where we will delve deeper into the exciting advancements in genomic sciences and GWAS studies !!