KLFDAPT implements genome scan based on KLFDAPC. The implementation is similar with PCA-based genome scan that reported by Patterson, N.,et al. 2006 and Luu, K., et al. 2017. The loci are weighted by the correlations between genetic variation and the genetic structure (first K reduced features).

KLFDAPT(klfdapc, gdsfile, estimation="auto", snp.id = NULL, n.RD = NULL, num.thread = 1L, with.id = TRUE, outgds = NULL, verbose = TRUE)

Arguments

klfdapc

KLFDAPC matrix. The results returned from KLFDAPC analysis.

gdsfile

The genotype file, it is an object of class SNPGDSFileClass, a SNP GDS file

estimation

a character string specifying the robust estimator to be used. The choices are: "mcd" for the Fast MCD algorithm of Rousseeuw and Van Driessen, "weighted" for the Reweighted MCD, "donostah" for the Donoho-Stahel projection based estimator, "M" for the constrained M estimator provided by Rocke, "pairwiseQC" for the orthogonalized quadrant correlation pairwise estimator, and "pairwiseGK" for the Orthogonalized Gnanadesikan-Kettenring pairwise estimator. The default "auto" selects from "donostah", "mcd", and "pairwiseQC" with the goal of producing a good estimate in a reasonable amount of time.

snp.id

a vector of snp id indicating the selected SNP id, if NULL, all SNPs are used

n.RD

a vector of integers indicating which reduced features to be used in KLFDAPC

num.thread

the number of (CPU) cores used; if NA, detect the number of cores automatically

with.id

if TRUE, the returned value with sample.id and sample.id

outgds

NULL or a character of file name for exporting correlations to a GDS file, see details

verbose

if TRUE, show information

Details

For more details, please read our vignette (https://github.com/xinghuq/KLFDAPC).

Value

cor

The correlation coefficients between the genotypes and the population genetic structure

q.values

The p.values and q.values for the correlation coefficients

%% ...

References

Patterson, N., Price, A. L., & Reich, D. (2006). Population structure and eigenanalysis. PLoS genet, 2(12), e190.

Luu, K., Bazin, E., & Blum, M. G. (2017). pcadapt: an R package to perform genome scans for selection based on principal component analysis. Molecular ecology resources, 17(1), 67-77.

A High-performance Computing Toolset for Relatedness and Principal Component Analysis of SNP Data. Xiuwen Zheng; David Levine; Jess Shen; Stephanie M. Gogarten; Cathy Laurie; Bruce S. Weir. Bioinformatics 2012; doi: 10.1093/bioinformatics/bts606.

R. A. Maronna and R. H. Zamar (2002) Robust estimates of location and dispersion of high-dimensional datasets. Technometrics 44 (4), 307–317.

Wang, J., Zamar, R., Marazzi, A., Yohai, V., Salibian-Barrera, M., Maronna, R., ... & Maechler, M. (2017). robust: Port of the S+“Robust Library”. R package version 0.4-18. Available online at: https://CRAN. R-project. org/package= robust.

Author

qinxinghu@gmail.com

Note

See also

Examples

###Hap_map data ### Doing KLFDAPC, default kernel=kernlab::polydot(degree = 1, scale = 1, offset = 1) # Open the GDS file library(KLFDAPC) library(SNPRelate)
#> Loading required package: gdsfmt
#> SNPRelate -- supported by Streaming SIMD Extensions 2 (SSE2)
genofile <- snpgdsOpen(snpgdsExampleFileName()) # Get sample id sample.id <- read.gdsn(index.gdsn(genofile, "sample.id")) # Get population information pop_code <- read.gdsn(index.gdsn(genofile, "sample.annot/pop.group")) # assume the order of sample IDs is as the same as population codes pop_code <- read.gdsn(index.gdsn(genofile, path="sample.annot/pop.group")) pop_code=factor(pop_code,levels=unique(pop_code)) # fliter LD thresholds for sensitivity analysis snpset <- snpgdsLDpruning(genofile, ld.threshold=0.2)
#> SNP pruning based on LD: #> Excluding 365 SNPs on non-autosomes #> Excluding 1 SNP (monomorphic: TRUE, MAF: NaN, missing rate: NaN) #> Working space: 279 samples, 8,722 SNPs #> using 1 (CPU) core #> sliding window: 500,000 basepairs, Inf SNPs #> |LD| threshold: 0.2 #> method: composite #> Chromosome 1: 75.84%, 543/716 #> Chromosome 2: 72.78%, 540/742 #> Chromosome 3: 74.38%, 453/609 #> Chromosome 4: 73.49%, 413/562 #> Chromosome 5: 76.33%, 432/566 #> Chromosome 6: 75.40%, 426/565 #> Chromosome 7: 75.42%, 356/472 #> Chromosome 8: 71.31%, 348/488 #> Chromosome 9: 77.64%, 323/416 #> Chromosome 10: 74.33%, 359/483 #> Chromosome 11: 77.18%, 345/447 #> Chromosome 12: 76.35%, 326/427 #> Chromosome 13: 75.87%, 261/344 #> Chromosome 14: 76.60%, 216/282 #> Chromosome 15: 76.34%, 200/262 #> Chromosome 16: 73.38%, 204/278 #> Chromosome 17: 73.91%, 153/207 #> Chromosome 18: 74.44%, 198/266 #> Chromosome 19: 85.00%, 102/120 #> Chromosome 20: 71.62%, 164/229 #> Chromosome 21: 75.40%, 95/126 #> Chromosome 22: 77.59%, 90/116 #> 6,547 markers are selected in total.
# Get all selected snp id snpset.id <- unlist(unname(snpset)) snpgdsClose(genofile) klfdapcfile <- KLFDAPC(snpgdsExampleFileName(), y=pop_code, snp.id=snpset.id,n.pc = 3, kernel=kernlab::polydot(degree = 1, scale = 1, offset = 1), r=5,num.thread=2)
#> Principal Component Analysis (PCA) on genotypes: #> Excluding 2,541 SNPs (non-autosomes or non-selection) #> Excluding 0 SNP (monomorphic: TRUE, MAF: NaN, missing rate: NaN) #> Working space: 279 samples, 6,547 SNPs #> using 2 (CPU) cores #> PCA: the sum of all selected genotypes (0,1,2) = 1829478 #> CPU capabilities: Double-Precision SSE2 #> Thu Oct 29 00:40:43 2020 (internal increment: 1276) #> [..................................................] 0%, ETC: --- [==================================================] 100%, completed, 0s #> Thu Oct 29 00:40:43 2020 Begin (eigenvalues and eigenvectors) #> Thu Oct 29 00:40:44 2020 Done. #> [1] "Doing KLFDAPC"
#> Warning: NaNs produced
#> Loading required package: WMDB
#> Loading required package: DA
#> #> Attaching package: 'DA'
#> The following object is masked from 'package:KLFDAPC': #> #> KLFDA
#> Warning: no non-missing arguments to min; returning Inf
#> Loading required package: klaR
#> Warning: package 'klaR' was built under R version 3.6.3
#> Loading required package: MASS
#> Warning: collapsing to unique 'x' values
#> Warning: collapsing to unique 'x' values
#> Warning: collapsing to unique 'x' values
#> Warning: collapsing to unique 'x' values
#> Error in density.default(xx, ...): 'x' contains missing values
snpgdsClose(genofile)
#> Error in closefn.gds(gdsobj): The GDS file is closed or uninitialized.
# Make a data.frame pop_str <- data.frame(sample.id = klfdapcfile$PCA$sample.id, pop = factor(pop_code)[match(klfdapcfile$PCA$sample.id, sample.id)], RD1 = klfdapcfile$KLFDAPC$Z[,1], # the first reduced feature RD2 = klfdapcfile$KLFDAPC$Z[,2], # the second reduced feature stringsAsFactors = FALSE)
#> Error in data.frame(sample.id = klfdapcfile$PCA$sample.id, pop = factor(pop_code)[match(klfdapcfile$PCA$sample.id, sample.id)], RD1 = klfdapcfile$KLFDAPC$Z[, 1], RD2 = klfdapcfile$KLFDAPC$Z[, 2], stringsAsFactors = FALSE): object 'klfdapcfile' not found
head(pop_str)
#> Error in head(pop_str): object 'pop_str' not found
rownames(pop_str)=pop_str$sample.id
#> Error in eval(expr, envir, enclos): object 'pop_str' not found
genofile <- snpgdsOpen(snpgdsExampleFileName()) chr <- read.gdsn(index.gdsn(genofile, "snp.chromosome")) klfdapcmat=as.matrix(pop_str[,-c(1:2)])
#> Error in as.matrix(pop_str[, -c(1:2)]): object 'pop_str' not found
### using the correlation to estimate the contribution of SNPs to the pop str (which is the same as PCA does ) klfdapcscanqvalue <- KLFDAPT(klfdapcmat, genofile, n.RD=1:2)
#> Error: object 'klfdapcmat' not found
snpgdsClose(genofile) ###Mmanhattan plot plot(-log10(klfdapcscanqvalue$q.values$q.values), ylim=c(0,4), xlab="", ylab="-log(q-value)",col=chr, pch="+")
#> Error in plot(-log10(klfdapcscanqvalue$q.values$q.values), ylim = c(0, 4), xlab = "", ylab = "-log(q-value)", col = chr, pch = "+"): object 'klfdapcscanqvalue' not found
abline(h=2, col="blue")
#> Error in int_abline(a = a, b = b, h = h, v = v, untf = untf, ...): plot.new has not been called yet