LD pruning
Before running PCA, we use LD pruning to select a set of independent SNPs for analysis. We use the snpgdsLDpruning
in the SNPRelate package, which returns a list of snp IDs.
library(SNPRelate)
# read in GDS data
gds <- snpgdsOpen(gdsfile)
snpset <- snpgdsLDpruning(gds, method="corr", slide.max.bp=10e6,
ld.threshold=sqrt(0.1), verbose=FALSE)
pruned <- unlist(snpset, use.names=FALSE)
length(pruned)
## [1] 3826
head(pruned)
## [1] 6 7 15 17 22 31
snpgdsClose(gds)
Pairwise Measures of Ancestry Divergence
It is possible to identify a subset of mutually unrelated individuals in a sample based solely on pairwise measures of genetic relatedness (i.e. kinship coefficients). However, in order to obtain accurate population structure inference for the entire sample, it is important that the ancestries of all individuals in the sample are represented by at least one individual in this subset. In order to identify a mutually unrelated and ancestry representative subset of individuals, PC-AiR also utilizes measures of ancestry divergence. These measures are calculated using the KING-robust kinship coefficient estimator (Manichaikul et al., 2010), which provides systematically negative estimates for unrelated pairs of individuals with different ancestry. The number of negative pairwise estimates that an individual has provides information regarding how different that individual’s ancestry is from the rest of the sample, which helps to prioritize individuals that should be kept in the ancestry representative subset.
The relevant output from the KING software is two text files with the file extensions .kin0 and .kin. The kingToMatrix
function can be used to extract the kinship coefficients (which we use as divergence measures) from this output and create a matrix usable by the *[GENESIS](https://bioconductor.org/packages/3.15/GENESIS)*
functions.
# create matrix of KING estimates
library(GENESIS)
KINGmat <- kingToMatrix(
c(system.file("extdata", "MXL_ASW.kin0", package="GENESIS"),
system.file("extdata", "MXL_ASW.kin", package="GENESIS")),
estimator = "Kinship")
KINGmat[1:5,1:5]
## 5 x 5 Matrix of class "dsyMatrix"
## NA19625 NA19649 NA19650 NA19651 NA19652
## NA19625 0.5000 -0.0761 -0.0656 -0.0497 -0.0486
## NA19649 -0.0761 0.5000 0.2513 -0.0187 -0.0141
## NA19650 -0.0656 0.2513 0.5000 -0.0037 -0.0033
## NA19651 -0.0497 -0.0187 -0.0037 0.5000 0.0112
## NA19652 -0.0486 -0.0141 -0.0033 0.0112 0.5000
The column and row names of the matrix are the individual IDs, and each off-diagonal entry is the KING-robust estimate for the specified pair of individuals.
Alternative to running the KING software, the snpgdsIBDKING
function from the SNPRelate package can be used to calculate the KING-robust estimates directly from a GDS file. The ouput of this function contains a matrix of pairwise estimates, which can be used by the GENESIS functions.
Running PC-AiR
The PC-AiR algorithm requires pairwise measures of both kinship and ancestry divergence in order to partition the sample into an “unrelated subset” and a “related subset.” The kinship coefficient estimates are used to identify relatives, as only one individual from a set of relatives can be included in the “unrelated subset,” and the rest must be assigned to the “related subset.” The ancestry divergence measures calculated from KING-robust are used to help select which individual from a set of relatives has the most unique ancestry and should be given priority for inclusion in the “unrelated subset.”
The KING-robust estimates read in above are always used as measures of ancestry divergence for unrelated pairs of individuals, but they can also be used as measures of kinship for relatives (NOTE: they may be biased measures of kinship for admixed relatives with different ancestry). The pcair
function performs the PC-AiR analysis.
We use the GWASTools package to create the GenotypeData object needed by GENESIS.
library(GWASTools)
HapMap_geno <- GdsGenotypeReader(filename = gdsfile)
# create a GenotypeData class object
HapMap_genoData <- GenotypeData(HapMap_geno)
HapMap_genoData
## An object of class GenotypeData
## | data:
## File: /tmp/RtmpoUJbO4/Rinst18496c12e17b9c/GENESIS/extdata/HapMap_ASW_MXL_geno.gds (901.8K)
## + [ ] *
## |--+ sample.id { Int32,factor 173 ZIP(40.9%), 283B } *
## |--+ snp.id { Int32 20000 ZIP(34.6%), 27.1K }
## |--+ snp.position { Int32 20000 ZIP(34.6%), 27.1K }
## |--+ snp.chromosome { Int32 20000 ZIP(0.13%), 103B }
## \--+ genotype { Bit2 20000x173, 844.7K } *
## | SNP Annotation:
## NULL
## | Scan Annotation:
## NULL
# run PC-AiR on pruned SNPs
mypcair <- pcair(HapMap_genoData, kinobj = KINGmat, divobj = KINGmat,
snp.include = pruned)
## Principal Component Analysis (PCA) on genotypes:
## Excluding 16,174 SNPs (non-autosomes or non-selection)
## Excluding 0 SNP (monomorphic: TRUE, MAF: NaN, missing rate: NaN)
## # of samples: 97
## # of SNPs: 3,826
## using 1 thread
## # of principal components: 32
## PCA: the sum of all selected genotypes (0,1,2) = 185850
## CPU capabilities: Double-Precision SSE2
## Tue Mar 15 16:31:01 2022 (internal increment: 32048)
##
[..................................................] 0%, ETC: ---
[==================================================] 100%, completed, 0s
## Tue Mar 15 16:31:01 2022 Begin (eigenvalues and eigenvectors)
## Tue Mar 15 16:31:01 2022 Done.
## SNP Loading:
## # of samples: 97
## # of SNPs: 3,826
## using 1 thread
## using the top 32 eigenvectors
## SNP Loading: the sum of all selected genotypes (0,1,2) = 185850
## Tue Mar 15 16:31:01 2022 (internal increment: 65536)
##
[..................................................] 0%, ETC: ---
[==================================================] 100%, completed, 0s
## Tue Mar 15 16:31:01 2022 Done.
## Sample Loading:
## # of samples: 76
## # of SNPs: 3,826
## using 1 thread
## using the top 32 eigenvectors
## Sample Loading: the sum of all selected genotypes (0,1,2) = 144468
## Tue Mar 15 16:31:01 2022 (internal increment: 65536)
##
[..................................................] 0%, ETC: ---
[==================================================] 100%, completed, 0s
## Tue Mar 15 16:31:01 2022 Done.
genoData
is a GenotypeData
class object
kinobj
is a matrix of pairwise kinship coefficient estimates
divobj
is a matrix of pairwise measures of ancestry divergence (KING-robust estimates)
snp.include
is a vector of snp IDs
If better estimates of kinship coefficients are available, then the kinobj
input can be replaced with a similar matrix of these estimates. The divobj
input should always remain as the KING-robust estimates.
Reference Population Samples
As PCA is an unsupervised method, it is often difficult to identify what population structure each of the top PCs represents. In order to provide some understanding of the inferred structure, it is sometimes recommended to include reference population samples of known ancestry in the analysis. If the data set contains such reference population samples, we may want to make sure that these reference population samples are included in the “unrelated subset.” This can be accomplished by setting the input unrel.set
equal to a vector, IDs
, of the individual IDs for the reference population samples.
mypcair <- pcair(HapMap_genoData, kinobj = KINGmat, divobj = KINGmat,
snp.include = pruned, unrel.set = IDs)
This will force individuals specified with unrel.set
into the “unrelated subset,” move any of their relatives into the “related subset,” and then perform the PC-AiR partitioning algorithm on the remaining samples.
Partition a Sample without Running PCA
It may be of interest to partition a sample into an ancestry representative “unrelated subset” of individuals and a “related subset” of individuals without performing a PCA. The pcairPartition
function, which is called within the pcair
function, enables the user to do this.
part <- pcairPartition(kinobj = KINGmat, divobj = KINGmat)
The output contains two vectors that give the individual IDs for the “unrelated subset” and the “related subset.”
head(part$unrels); head(part$rels)
## [1] "NA19708" "NA19711" "NA19712" "NA19737" "NA19740" "NA19741"
## [1] "NA19686" "NA20346" "NA20345" "NA20347" "NA19664" "NA19677"
As with the pcair
function, the input unrel.set
can be used to specify certain individuals that must be part of the “unrelated subset.”
Output from PC-AiR
An object returned from the pcair
function has class pcair
. A summary
method for an object of class pcair
is provided to obtain a quick overview of the results.
summary(mypcair)
## Call:
## .pcair(gdsobj = gdsobj, kinobj = kinobj, divobj = divobj, kin.thresh = kin.thresh,
## div.thresh = div.thresh, unrel.set = unrel.set, sample.include = sample.include,
## snp.include = snp.include, num.cores = num.cores, verbose = verbose)
##
## PCA Method: PC-AiR
##
## Sample Size: 173
## Unrelated Set: 97 Samples
## Related Set: 76 Samples
##
## Kinship Threshold: 0.02209709
## Divergence Threshold: -0.02209709
##
## Principal Components Returned: 32
## Eigenvalues: 2.946 1.717 1.326 1.292 1.277 1.255 1.242 1.223 1.219 1.201 ...
## SNPs Used: 3826
The output provides the total sample size along with the number of individuals assigned to the unrelated and related subsets, as well as the threshold values used for determining which pairs of individuals were related or ancestrally divergent. The eigenvalues for the top PCs are also shown, which can assist in determining the number of PCs that reflect structure. The minor allele frequency (MAF) filter used for excluding SNPs is also specified, along with the total number of SNPs analyzed after this filtering. Details describing how to modify the analysis parameters and the available output of the function can be found with the command help(pcair)
.
Plotting PC-AiR PCs
The GENESIS package also provides a plot
method for an object of class pcair
to quickly visualize pairs of PCs. Each point in one of these PC pairs plots represents a sample individual. These plots help to visualize population structure in the sample and identify clusters of individuals with similar ancestry.
# plot top 2 PCs
plot(mypcair)
# plot PCs 3 and 4
plot(mypcair, vx = 3, vy = 4)
The default is to plot PC values as black dots and blue pluses for individuals in the “unrelated subset” and “related subsets” respectively. The plotting colors and characters, as well as other standard plotting parameters, can be changed with the standard input to the plot
function.