vignettes/Genome_scan_KLFDAPC.Rmd
Genome_scan_KLFDAPC.Rmd
KLFDAPC has multiple advantages over the linear inference and linear discriminant analysis. The population structure inferred using KLFDAPC rectifies the limitation of linear population structure. Thereby, KLFDAPC could be helpful for correcting the genetic gradients (population stratification) in many aspects of population genetic studies, such as correction for geographic genetic structure in genome scan for detecting loci involved in adaptation, and in GWAS for identifying the signatures related to diseases or complex traits. We demonstrate an example using KLFDAPC to detect signatures related to local adaptation. In what follows, we illustrate the step-by-step implementation of genome scan using KLFDAPC.
In this vignette, we use a common implementation similar to PCA-based genome scan that reported by Patterson, N.,et al. 2006 and Luu, K., et al. 2017. However, in this implementation, the loci are weighted by correlation coefficients. We recommended users using more sophisticated implementation with high performance, such as DeepGenomeScan (Qin,X. et al. 2020).
Install packages and dependences
if (!requireNamespace("BiocManager", quietly=TRUE)) install.packages("BiocManager") if (!requireNamespace("SNPRelate", quietly=TRUE)) BiocManager::install("SNPRelate") if (!requireNamespace("KLFDAPC", quietly=TRUE)) devtools::install_github("xinghuq/KLFDAPC") if (!requireNamespace("DA", quietly=TRUE)) devtools::install_github("xinghuq/DA")
Our package uses the high performance parallel computation processed with SNPRelate (Zheng et al., 2012). It is written by C++, and the data is stored in an efficient format, snpGDS format.
In this vignette, We use the human HapMap data (https://www.genome.gov/10001688/international-hapmap-project) to show how to detect loci involved in adaptation using KLFDAPC.
Data preparation, we use the data that have already in GDS format stored in the library.
# Open the GDS file genofile <- snpgdsOpen(snpgdsExampleFileName()) # Get sample id sample.id <- read.gdsn(index.gdsn(genofile, "sample.id")) # Get population information pop_code <- read.gdsn(index.gdsn(genofile, "sample.annot/pop.group")) # assume the order of sample IDs is as the same as population codes pop_code <- read.gdsn(index.gdsn(genofile, path="sample.annot/pop.group")) pop_code=factor(pop_code,levels=unique(pop_code)) table(pop_code) set.seed(1000) # fliter LD thresholds for sensitivity analysis snpset <- snpgdsLDpruning(genofile, ld.threshold=0.2) # Get all selected snp id snpset.id <- unlist(unname(snpset)) head(snpset.id) snpgdsClose(genofile) # pop_code # YRI CEU HCB JPT # 93 92 47 47 # SNP pruning based on LD: # Excluding 365 SNPs on non-autosomes # Excluding 1 SNP (monomorphic: TRUE, MAF: NaN, missing rate: NaN) # Working space: 279 samples, 8,722 SNPs # using 1 (CPU) core # sliding window: 500,000 basepairs, Inf SNPs # |LD| threshold: 0.2 # method: composite # Chromosome 1: 76.12%, 545/716 # Chromosome 2: 72.78%, 540/742 # Chromosome 3: 74.71%, 455/609 # Chromosome 4: 73.49%, 413/562 # Chromosome 5: 76.86%, 435/566 # Chromosome 6: 75.75%, 428/565 # Chromosome 7: 75.42%, 356/472 # Chromosome 8: 71.11%, 347/488 # Chromosome 9: 77.88%, 324/416 # Chromosome 10: 74.12%, 358/483 # Chromosome 11: 77.85%, 348/447 # Chromosome 12: 76.81%, 328/427 # Chromosome 13: 76.16%, 262/344 # Chromosome 14: 76.60%, 216/282 # Chromosome 15: 76.34%, 200/262 # Chromosome 16: 72.66%, 202/278 # Chromosome 17: 73.91%, 153/207 # Chromosome 18: 73.68%, 196/266 # Chromosome 19: 85.00%, 102/120 # Chromosome 20: 71.62%, 164/229 # Chromosome 21: 76.98%, 97/126 # Chromosome 22: 75.86%, 88/116 # 6,557 markers are selected in total. # [1] 1 2 4 5 7 10
We use a Gaussian kernel with sigma =5 to infer the population structure. After we obtain the genetic structure, we then use it to do genome scan.
showfile.gds(closeall=TRUE)
klfdapcfile <- KLFDAPC(snpgdsExampleFileName(), y=pop_code, snp.id=snpset.id,n.pc = 3,kernel=kernlab::rbfdot(sigma = 5), r=5, num.thread=2)
# Warning: package 'klaR' was built under R version 3.6.3
# Make a data.frame
pop_str <- data.frame(sample.id = klfdapcfile$PCA$sample.id,
pop = factor(pop_code)[match(klfdapcfile$PCA$sample.id, sample.id)],
RD1 = klfdapcfile$KLFDAPC$Z[,1], # the first reduced feature
RD2 = klfdapcfile$KLFDAPC$Z[,2], # the second reduced feature
stringsAsFactors = FALSE)
head(pop_str)
rownames(pop_str)=pop_str$sample.id
# Principal Component Analysis (PCA) on genotypes:
# Excluding 2,531 SNPs (non-autosomes or non-selection)
# Excluding 0 SNP (monomorphic: TRUE, MAF: NaN, missing rate: NaN)
# Working space: 279 samples, 6,557 SNPs
# using 2 (CPU) cores
# PCA: the sum of all selected genotypes (0,1,2) = 1828045
# CPU capabilities: Double-Precision SSE2
# Thu Oct 29 00:40:50 2020 (internal increment: 1276)
#
[..................................................] 0%, ETC: ---
[==================================================] 100%, completed, 0s
# Thu Oct 29 00:40:50 2020 Begin (eigenvalues and eigenvectors)
# Thu Oct 29 00:40:50 2020 Done.
# [1] "Doing KLFDAPC"
# sample.id pop RD1 RD2
# 1 NA19152 YRI -53.71177 -19.85460
# 2 NA19139 YRI -55.41972 -19.32379
# 3 NA18912 YRI -49.86350 -22.98661
# 4 NA19160 YRI -58.62539 -22.75500
# 5 NA07034 CEU 79.90126 73.59765
# 6 NA07055 CEU 82.78208 75.75534
Project the genetic features onto a plane.
plot(pop_str$RD1, pop_str$RD2, col=as.integer(pop_str$pop), xlab="RD 1", ylab="RD 2") legend("topright", legend=levels(pop_str$pop), pch="o", col=1:nlevels(pop_str$pop))
After we choose the genetic features, we then use the genetic features to do genome scan with KLFDAPT.
?KLFDAPT to see the documentation. This genome scan is handeled with a C++ function, which is faster than other implementation. Users can also set the parallel process by changing the paramters of num.threshold.
showfile.gds(closeall=TRUE)
# Get chromosome index
genofile <- snpgdsOpen(snpgdsExampleFileName())
chr <- read.gdsn(index.gdsn(genofile, "snp.chromosome"))
klfdapcmat=as.matrix(pop_str[,-c(1:2)])
### Doing genome scan
### using the correlation to estimate the contribution of SNPs to the pop str (which is the same as PCA does )
klfdapcscanqvalue <- KLFDAPT(klfdapcmat, genofile, n.RD=1:2)
# Warning: package 'robust' was built under R version 3.6.3
# Warning: package 'fit.models' was built under R version 3.6.3
# SNP Correlation:
# Working space: 279 samples, 9088 SNPs
# using 1 (CPU) core
# using the top 2 eigenvectors
# Correlation: the sum of all selected genotypes (0,1,2) = 2553065
# Thu Oct 29 00:40:51 2020 (internal increment: 10216)
#
[..................................................] 0%, ETC: ---
[==================================================] 100%, completed, 0s
# Thu Oct 29 00:40:51 2020 Done.
###plot manhattan plot plot(-log10(klfdapcscanqvalue$q.values$q.values), ylim=c(0,4), xlab="", ylab="-log(q-value)",col=chr, pch="+") abline(h=2, col="blue")
Patterson, N., Price, A. L., & Reich, D. (2006). Population structure and eigenanalysis. PLoS genet, 2(12), e190.
Luu, K., Bazin, E., & Blum, M. G. (2017). pcadapt: an R package to perform genome scans for selection based on principal component analysis. Molecular ecology resources, 17(1), 67-77.
A High-performance Computing Toolset for Relatedness and Principal Component Analysis of SNP Data. Xiuwen Zheng; David Levine; Jess Shen; Stephanie M. Gogarten; Cathy Laurie; Bruce S. Weir. Bioinformatics 2012; doi: 10.1093/bioinformatics/bts606.
R. A. Maronna and R. H. Zamar (2002) Robust estimates of location and dispersion of high-dimensional datasets. Technometrics 44 (4), 307–317.
Wang, J., Zamar, R., Marazzi, A., Yohai, V., Salibian-Barrera, M., Maronna, R., … & Maechler, M. (2017). robust: Port of the S+“Robust Library”. R package version 0.4-18. Available online at: https://CRAN. R-project. org/package= robust.
Qin, X., Chiang, C. W., & Gaggiotti, O. E. (2022). KLFDAPC: a supervised machine learning approach for spatial genetic structure analysis. Briefings in Bioinformatics, 23(4), bbac202.
Zheng X, Levine D, Shen J, Gogarten S, Laurie C, Weir B (2012). A High-performance Computing Toolset for Relatedness and Principal Component Analysis of SNP Data.” Bioinformatics, 28(24), 3326-3328. doi: 10.1093/bioinformatics/bts606.