KLFDAPC for detection of natural selection

KLFDAPC has multiple advantages over the linear inference and linear discriminant analysis. The population structure inferred using KLFDAPC rectifies the limitation of linear population structure. Thereby, KLFDAPC could be helpful for correcting the genetic gradients (population stratification) in many aspects of population genetic studies, such as correction for geographic genetic structure in genome scan for detecting loci involved in adaptation, and in GWAS for identifying the signatures related to diseases or complex traits. We demonstrate an example using KLFDAPC to detect signatures related to local adaptation. In what follows, we illustrate the step-by-step implementation of genome scan using KLFDAPC.

In this vignette, we use a common implementation similar to PCA-based genome scan that reported by Patterson, N.,et al. 2006 and Luu, K., et al. 2017. However, in this implementation, the loci are weighted by correlation coefficients. We recommended users using more sophisticated implementation with high performance, such as DeepGenomeScan (Qin,X. et al. 2020).

Install packages and dependences

if (!requireNamespace("BiocManager", quietly=TRUE))
    install.packages("BiocManager")

if (!requireNamespace("SNPRelate", quietly=TRUE))
BiocManager::install("SNPRelate")

if (!requireNamespace("KLFDAPC", quietly=TRUE))
devtools::install_github("xinghuq/KLFDAPC")

if (!requireNamespace("DA", quietly=TRUE))
devtools::install_github("xinghuq/DA")

Our package uses the high performance parallel computation processed with SNPRelate (Zheng et al., 2012). It is written by C++, and the data is stored in an efficient format, snpGDS format.

Preparing data

In this vignette, We use the human HapMap data (https://www.genome.gov/10001688/international-hapmap-project) to show how to detect loci involved in adaptation using KLFDAPC.

Data preparation, we use the data that have already in GDS format stored in the library.

# Open the GDS file
genofile <- snpgdsOpen(snpgdsExampleFileName())
# Get sample id
sample.id <- read.gdsn(index.gdsn(genofile, "sample.id"))
# Get population information

pop_code <- read.gdsn(index.gdsn(genofile, "sample.annot/pop.group"))

# assume the order of sample IDs is as the same as population codes

pop_code <- read.gdsn(index.gdsn(genofile, path="sample.annot/pop.group"))
pop_code=factor(pop_code,levels=unique(pop_code))
table(pop_code)

set.seed(1000)

# fliter LD thresholds for sensitivity analysis
snpset <- snpgdsLDpruning(genofile, ld.threshold=0.2)

# Get all selected snp id
snpset.id <- unlist(unname(snpset))

head(snpset.id)

snpgdsClose(genofile)
# pop_code
# YRI CEU HCB JPT 
#  93  92  47  47 
# SNP pruning based on LD:
# Excluding 365 SNPs on non-autosomes
# Excluding 1 SNP (monomorphic: TRUE, MAF: NaN, missing rate: NaN)
# Working space: 279 samples, 8,722 SNPs
#     using 1 (CPU) core
#     sliding window: 500,000 basepairs, Inf SNPs
#     |LD| threshold: 0.2
#     method: composite
# Chromosome 1: 76.12%, 545/716
# Chromosome 2: 72.78%, 540/742
# Chromosome 3: 74.71%, 455/609
# Chromosome 4: 73.49%, 413/562
# Chromosome 5: 76.86%, 435/566
# Chromosome 6: 75.75%, 428/565
# Chromosome 7: 75.42%, 356/472
# Chromosome 8: 71.11%, 347/488
# Chromosome 9: 77.88%, 324/416
# Chromosome 10: 74.12%, 358/483
# Chromosome 11: 77.85%, 348/447
# Chromosome 12: 76.81%, 328/427
# Chromosome 13: 76.16%, 262/344
# Chromosome 14: 76.60%, 216/282
# Chromosome 15: 76.34%, 200/262
# Chromosome 16: 72.66%, 202/278
# Chromosome 17: 73.91%, 153/207
# Chromosome 18: 73.68%, 196/266
# Chromosome 19: 85.00%, 102/120
# Chromosome 20: 71.62%, 164/229
# Chromosome 21: 76.98%, 97/126
# Chromosome 22: 75.86%, 88/116
# 6,557 markers are selected in total.
# [1]  1  2  4  5  7 10

Doing KLFDAPC on genotype data with a Gaussian kernel

We use a Gaussian kernel with sigma =5 to infer the population structure. After we obtain the genetic structure, we then use it to do genome scan.

Visualize the population structure

Project the genetic features onto a plane.

plot(pop_str$RD1, pop_str$RD2, col=as.integer(pop_str$pop), xlab="RD 1", ylab="RD 2")
legend("topright", legend=levels(pop_str$pop), pch="o", col=1:nlevels(pop_str$pop))
&nbsp;

 

Plot manhattan plot

###plot manhattan plot 
 plot(-log10(klfdapcscanqvalue$q.values$q.values), ylim=c(0,4), xlab="", ylab="-log(q-value)",col=chr, pch="+")
   abline(h=2, col="blue")
&nbsp;

 

Reference

Patterson, N., Price, A. L., & Reich, D. (2006). Population structure and eigenanalysis. PLoS genet, 2(12), e190.

Luu, K., Bazin, E., & Blum, M. G. (2017). pcadapt: an R package to perform genome scans for selection based on principal component analysis. Molecular ecology resources, 17(1), 67-77.

A High-performance Computing Toolset for Relatedness and Principal Component Analysis of SNP Data. Xiuwen Zheng; David Levine; Jess Shen; Stephanie M. Gogarten; Cathy Laurie; Bruce S. Weir. Bioinformatics 2012; doi: 10.1093/bioinformatics/bts606.

R. A. Maronna and R. H. Zamar (2002) Robust estimates of location and dispersion of high-dimensional datasets. Technometrics 44 (4), 307–317.

Wang, J., Zamar, R., Marazzi, A., Yohai, V., Salibian-Barrera, M., Maronna, R., … & Maechler, M. (2017). robust: Port of the S+“Robust Library”. R package version 0.4-18. Available online at: https://CRAN. R-project. org/package= robust.

Qin, X., Chiang, C. W., & Gaggiotti, O. E. (2022). KLFDAPC: a supervised machine learning approach for spatial genetic structure analysis. Briefings in Bioinformatics, 23(4), bbac202.

Zheng X, Levine D, Shen J, Gogarten S, Laurie C, Weir B (2012). A High-performance Computing Toolset for Relatedness and Principal Component Analysis of SNP Data.” Bioinformatics, 28(24), 3326-3328. doi: 10.1093/bioinformatics/bts606.