Allele frequency is a frequency of given variant in a population. The Allele Frequency spectrum (AFS, or Site Frequency Spectrum) is a histogram that describes the number of genetic variants with a given frequency in a set of sequences. It can be used to estimate the dynamics of the population (expansion, bottleneck etc.).
In this part of the tutorial, we are going to learn how to analyze and show the differences in allele frequencies between Africans and Americans individuals in the HPRC dataset. By the end of this tutorial, we should have a graph that shows how these gene frequencies vary between the two groups, e.g. which population is enriched for rare variation, giving us a concise picture of genetic diversity betwee these two populations.
We will learn how to:
plink2
1_afs
within the mytutorial
foldercd mytutorial
cd 1_afs
Calculate allele frequencies directly from vcf, only for the SNPs, for all samples.
plink2 --vcf /home/genomics/workshop_materials/population_genomics/chr15.pan.fa.a2fb268.4030258.6a1ecc2.smooth.reliable.vcf.gz \
--freq \
--allow-extra-chr \
--vcf-half-call m \
--snps-only \
--out chr15
## --vcf loads a genotype VCF file
## --allow-extra-chr force to accept chromosome code 'chm13#chr15'
## --vcf-half-call specify how GT half-call should be processed. See tutorial section 0 for more info
## --snps-only selects only loci that are single nucleotide polymorphisms
## --out gives a human-readable name
This command will produce these files:
-rw-rw-r-- 1 enza enza 33492352 gen 14 11:28 chr15.afreq
-rw-rw-r-- 1 enza enza 923 gen 14 11:28 chr15.log
Open the .log
file and read it to check that all went well.
Open the .afreq
file and look at it. Check the info on the content of the columns and get familiar with it:
cat chr15.afreq | column -t | less -S
Plot the counts of the alternate allele to obtain the allele frequency spectrum from the alternate allele counts:
Rscript ../../rscripts/1.plotafsCOUNTS.R chr15.afreq HPRC-allele-counts
## first argument is the file with the allele counts
## second argument is the title of the graph
Open the file with the plot:
Next we want to calculate allele frequencies specifically for the African and American populations. This time we will use the plink2
fileset instead of the vcf
.
The crucial first step involves identifying the samples that belong to these specific populations. This information is in metadata files that we need to reference using the --keep
option:
plink2 --pfile ../data/chr15 \
--keep ../../metadata/afr.onlyid \
--freq \
--allow-extra-chr \
--snps-only \
--out chr15.afr
plink2 --pfile ../data/chr15 \
--keep ../../metadata/amr.onlyid \
--freq \
--allow-extra-chr \
--snps-only \
--out chr15.amr
## --pfile specifies which plink2 file set to use, you only need to specify the file prefix (e.g. chr15)
## --keep reference to the file containing the individuals to filter for (i.e. keep only individuals listed in the file)
## Note that we are filtering for individuals from African (afr) and the Americas (amr).
This will produce:
-rw-rw-r-- 1 enza enza 32483032 gen 14 11:38 chr15.afr.afreq
-rw-rw-r-- 1 enza enza 863 gen 14 11:38 chr15.afr.log
-rw-rw-r-- 1 enza enza 30547659 gen 14 11:38 chr15.amr.afreq
-rw-rw-r-- 1 enza enza 863 gen 14 11:38 chr15.amr.log
As usual, open the log and check for error messages if any.
Open the .afreq
files and look at the content
To produce the AFS we should bin the allele frequencies in classes. In plink2, the --freq
option includes the refbin
(bin by the reference allele) and alt1bin
(bin by the alternative allele) modifiers for generating histogram counts during allele frequency calculations. To use these modifiers, you must specify the bin intervals either directly in the command line or by referencing a file, and using the refbin-file
and alt1bin-file
modifiers.
We can produce the bin info file with a simple shell command that write a sequence form 0.1 to 1 in intervals of 0.05 and store it in a file called mybins.txt
:
LC_NUMERIC="C" seq -s " " 0.05 0.05 1.0 > mybins.txt
Look at the mybins.txt
file so you better understand what the code above did:
cat mybins.txt
Calculate the counts for the AFS based on alternate allele frequencies:
plink2 --pfile ../data/chr15 \
--keep ../../metadata/afr.onlyid \
--freq alt1bins-file='mybins.txt' \
--allow-extra-chr --snps-only \
--out chr15.afr
plink2 --pfile ../data/chr15 \
--keep ../../metadata/amr.onlyid \
--freq alt1bins-file='mybins.txt' \
--allow-extra-chr \
--snps-only \
--out chr15.amr
## --freq alt1bins-file='mybins.txt' specify to bin the frequency as described in the file mybins.txt
These commands will produce:
-rw-rw-r-- 1 enza enza 213 gen 16 09:43 chr15.afr.afreq.alt1.bins
-rw-rw-r-- 1 enza enza 212 gen 16 09:43 chr15.amr.afreq.alt1.bins
Look at the files chr15.afr.afreq.alt1.bins
and chr15.afr.afreq.alt1.bins
. To plot the AFS, we will use R
. You can load the two files in Rstudio
and write your own code (using ggplot - from what you learned earlier in the week!) or use the pre-prepared R script 1.plotafs.R
in the ../../rscripts
folder. Note that you can use the Rscript
command to execute R scripts directly from the command line without the need for an interactive R environment.
Rscript ../../rscripts/1.plotafsFREQ.R chr15.afr.afreq.alt1.bins chr15.amr.afreq.alt1.bins afr-amr-ALT
# first argument is the allele frequency in the first population
# second argument is the allele frequency in the second population
# third argument is the title of the graph, I chose afr-amr to remember the name of the populations and ALT
# to remember that we are considering the Alternate allele frequency. Do not use spaces if multiple words!
Look at the png
file just produced. To view the png image, you can use Guacamole desktop or copy the file to your laptop using scp
command.