popgen-evomics

Back to sections

Population differentiation using PCA

(or the lack of it on the short arms of acrocentric chromosomes)

Principal Component Analysis (PCA) is a statistical technique used to reduce the dimensionality of large datasets by transforming them into a set of linearly uncorrelated variables called principal components. In genetics, PCA is used to identify and visualize patterns of genetic variation within a population, helping to understand genetic differences and similarities among individuals or groups.

acro pca

This image from PMID: 37165242 shows PCA on the entire chromosome 15 compared with PCA of long and short arms. Both the entire chromosome and the long arm analyses show a degree of separation between Africans and Americans, while the population structure disappear on the short arm. This is because short arms of the human acrocentric chromosomes recombine PMID: 37165242

In this tutorial, we will replicate the findings form the figure above. We will perform PCA analysis of the HPRC dataset. By the end of this tutorial, we should have a graphs that show us how individuals relate to others based on their genetic similarity/diversity.

We will learn

cd mytutorial
cd 3_pca

Before starting: f o r m a t t i n g

Understanding and dealing with formatting issues sometimes take more time than runnig the analyses! The VCF we are using contains repeated variant IDs. As en example if you grep for >77497>77500 you will find three lines corresponding to three different genomic location for which the variant ID is the same.

chm13#chr15     68355   >77497>77500    A       T
chm13#chr15     74647   >77497>77500    A       T
chm13#chr15     80963   >77497>77500    T       A

This presents a challenge specific to pangenomes as the VCF (Variant Call Format) was not originally designed to fully represent the comprehensive information that pangenome graphs contain. In this case >77497>77500 represents a chunk of the genome containing multiple nested variants.

Some of the plink (and other software) functions require unique IDs, therefore with the --set-all-var-ids we will convert variant IDs in the format that make them unique. We thus will recode the variant ID before proceding to pca analysis, so that they will look like this:

chm13#chr15     68355   chm13#chr15:68355    A       T
chm13#chr15     74647   chm13#chr15:74647    A       T
chm13#chr15     80963   chm13#chr15:80963    T       A

To do this we will use plink and produce a new vcf using the option --export and applying the --set-all-var-ids:

plink2 --pfile ../data/chr15 \
       --allow-extra-chr \
       --snps-only \
       --set-all-var-ids @:#$r$a  \
       --export vcf \
       --out chr15uniqID


## --set-all-var-ids <template string>  deal with the annoying problem of having repeated variant ID in the VCF file (see explanation below this box). 
## Look at the [manual](https://www.cog-genomics.org/plink/2.0/data#set_all_var_ids) for a full explanation or at the box below for a quick one. 

## --out  name of the new vcf with the variant ID changed to be unique note that we do not need to specify file extension 

Open the chr15uniqID.vcf and look at the new variant ID. Next step is to gzip the vcf and produce the vcf index:

bgzip chr15uniqID.vcf 
tabix -p vcf  chr15uniqID.vcf.gz

(a) PCA for the entire chromosome

1. Calculate allele frequencies:

Allele frequncies where the variant ID will be corrected:

plink2 --vcf chr15uniqID.vcf.gz \
       --allow-extra-chr \
       --vcf-half-call m \
       --snps-only \
       --freq \
       --out chr15uniqID 

This will produce:

-rw-rw-r-- 1 enza enza 36451525 gen 15 15:30 chr15uniqID.afreq

2. Remove GRC-hg38

The sample named grch38 correspond to the sequence of GRC-hg38. Because of the different sequencing methods adopted, grch38 does not contain as much information as the other samples, therefore we will exclude it from the analysis. To do so we create a file that contain a list of individuals to be removed. This list will actually contain only grch38 and therefore be of length 1:

echo grch38 > samplestoremove.list

Of course you are welcome to run the pca including grch38. If so, what do you expect?

3. PCA

Finally! We have all the ingredient to run a pca: a vcf with unique variant IDs, a corresponding file of allele frequencies, a file with samples to be removed form the analysis:

plink2 --vcf chr15uniqID.vcf.gz \
       --allow-extra-chr \
       --vcf-half-call m \
       --snps-only \
       --remove samplestoremove.list \
       --read-freq chr15uniqID.afreq \
       --pca \
       --out all 

## --vcf loads a genotype VCF file
## --allow-extra-chr  force to accept chromosome code 'chm13#chr15'
## --vcf-half-call specify how GT half-call should be processed. See tutorial section 0 for more info
## --snps-only  selects only loci that are single nucleotide polymorphisms 
## --read-freq refers to the allele frequency file 
## --pca performs the pca; as default 10 components are extracted 
## --out gives a human-readable name 

This will produce:

-rw-rw-r-- 1 enza enza       79 gen 15 18:08 all.eigenval
-rw-rw-r-- 1 enza enza     4866 gen 15 18:08 all.eigenvec
-rw-rw-r-- 1 enza enza     1395 gen 15 18:08 all.log

From the plink manual: .eigenvec file is a text file with a header line and between 1+V and 3+V columns per sample, where V is the number of requested principal components. The first columns contain the sample ID, and the rest are principal component weights in the same order as the .eigenval values (with column headers ‘PC1’, ‘PC2’, …)

Explained variance

We first want to understand how much of the total variability in the dataset is accounted for by each principal component calculating the explained variance for each component. It quantifies how much of the total variability in a dataset is captured by each principal component.

To calculate the percent of explained varinace, for each principal component, divide its eigenvalue by the total sum of all eigenvalues. This gives the proportion of the variance that is explained by that particular principal component. To express this as a percentage, multiply the proportion by 100. This calculattion and its graphical representation is coded in rscripts/3screeplot.R

Rscript ../../rscripts/3.screeplot.R all.eigenval 

This should result in this image: scree

Scatter Plot

Next we want to make a scatter plot of the first two components:

Rscript  ../../rscripts/3.plotPCA.R all.eigenvec  ../../metadata/hprc.metadata

This should result in this image: pca

(b) PCA for the p-arm and the q-arm

Now try by yourself to implement pca only for the p-arm or the q-arm of the chromosome to see if there is a change. From this list of centromeres coordinates we can learn that for chr 15 the centromere is approximately between 15,412,039 bp and 17,709,803 bp.

Our vcf spans variants from position 410 to 99,753,074:


zcat chr15uniqID.vcf.gz | grep -v '##' | cut -f1,2  | head -2 
#CHROM  POS
chm13#chr15     410

zcat chr15uniqID.vcf.gz | grep -v '##' | cut -f1,2  | tail  -1 
chm13#chr15     99753074

icon How would you do it?

We can make two sub-vcfs each containing only markers from the q-arm or the q-arm using bcftools:

bcftools view -r chm13#chr15:17709803-99753074 -O z  -o q-arm.vcf.gz chr15uniqID.vcf.gz 

# -r specifies the genomic regioon 
# -O z produced a gzipped file 
# -o gives the output name 
# chr15uniqID.vcf.gz is teh starting vcf 

After producing the vcf we produce its index using tabix:

tabix -p vcf q-arm.vcf.gz

Similarly for the p-arm:

bcftools view -r chm13#chr15:0-15412039 -O z  -o p-arm.vcf.gz chr15uniqID.vcf.gz 
tabix -p vcf p-arm.vcf.gz

// Or you can use this already-made vcf for the p-arm: // shell // /home/genomics/workshop_materials/population_genomics/chr15.confident.SNPs.pArm.vcf.gz //

Now repeat the PCA with plink:

 
 plink2 --vcf q-arm.vcf.gz\
        --allow-extra-chr \
        --vcf-half-call m  \
        --snps-only \
        --remove samplestoremove.list \
        --read-freq chr15uniqID.afreq \
        --pca \
        --out q-arm

plink2 --vcf p-arm.vcf.gz \
        --allow-extra-chr \
        --vcf-half-call m \
        --snps-only \
        --rm-dup exclude-mismatch \
        --remove samplestoremove.list \
        --read-freq chr15uniqID.afreq \
        --mind 0.7 \
        --pca \
        --out p-arm

And the plotting:

Rscript  ../../rscripts/3.plotPCA.R p-arm.eigenvec  ../../metadata/hprc.metadata

Rscript  ../../rscripts/3.plotPCA.R q-arm.eigenvec  ../../metadata/hprc.metadata

icon Do you see any difference among p-arm, q-arm, and entire chromosome?

Back to sections