Population Genomics: Identifying Outlier Loci and Predicting Future Adaptation

In this tutorial we will be working with the Yosemite Toad, which is found in the central Sierra Nevada mountains of eastern California. The distribution of Yosemite Toads overlaps with two national parks, Yosemite and Kings Canyon. The genomic datasets in this tutorial come from the overlap between toads and the two parks: Yosemite (a large area of approximately 40 x 50 miles) and Kings Canyon (a smaller area of approximately 20 x 20 miles).

Our goal is to use a ddRADseq dataset with 1672 variable SNP loci across two national parks to detect loci responding to environmental selection. There are two broad categories of outlier tests: (1) those which model neutral genetic structure among populations or individuals, then identify SNPs which are outside of that distribution; and (2) tests which directly correlate environmental variables with SNP allele frequencies. Recent advancements have overcome high Type I and Type II error rates in both types of test, by more accurately accounting for “neutral” population structure in the data. We will use two tests identify candidate outlier loci, and then use forecasts of future climate data to predict the fate of Yosemite Toad adaptation near the end of the 21st century.

1. Set Up Shop: Install, Load, Prepare Data

Clean up your global environment first. Remove all R objects and start with a clean slate.

rm(list=ls())

Set your working directory and apply global settings. In this case we want to save our initial import settings and graphical parameters for later, when we change them.

setwd("~/Desktop/Tutorial/")
options(stringsAsFactors = F)
pars <- par()

Source the “behind-the-scenes” functions that are used for in this tutorial.

source("./Rscripts/624functions.R")

Unload currently loaded libraries to clear the memory and start fresh.

invisible(unloadPackages(.packages()))

Load all libraries used in this tutorial.

libraries <- c("rgdal", "raster", "gtools", "adespatial", "ade4", "adegraphics", "spdep", "maptools", "adegenet", "qvalue", "pcadapt", "gdm", "gradientForest", "vegan", "rangeBuilder", "rgeos", "assigner", "ggfortify")
sapply(libraries, function(x) { suppressMessages(require(x, character.only=T)) } )
##          rgdal         raster         gtools     adespatial           ade4 
##           TRUE           TRUE           TRUE           TRUE           TRUE 
##    adegraphics          spdep       maptools       adegenet         qvalue 
##           TRUE           TRUE           TRUE           TRUE           TRUE 
##        pcadapt            gdm gradientForest          vegan   rangeBuilder 
##           TRUE           TRUE           TRUE           TRUE           TRUE 
##          rgeos       assigner      ggfortify 
##           TRUE           TRUE           TRUE

Import all files needed for this tutorial. These include three shapefiles (the Anaxyrus canorus species range, Yosemite National Park, and Kings Canyon National Park), the lat/long of all points for which we have ddRADseq data, a list of populations in a specific order, a list of loci used, and the actual ddRADseq data in multiple formats.

canorus <- readOGR(dsn="./GIS", layer="canorus")
kica <- readOGR(dsn="./GIS", layer="Kingsbndry")
yose <- readOGR(dsn="./GIS", layer="YOSEbndry")
p <- read.table("./GIS/pops_all.txt", header=T)
poporder <- read.table("./Data/poporder.bayenv2.txt", header=T)[,1]
loci <- read.table("./Data/data_radsnp_allparks.loci", header=T)
data.file.3 <- "./Data/data_radsnp_allparks.ped"
data.file.4 <- "./Data/data_radsnp_allparks.stru"

2. Extract All Environmental Data

Now we will extract the environmental data using the coordinates of our ddRADseq sample locations. These include a DEM (elevation from SRTM), the 19 BioClim variables from a global climate model, and several BCM variables from a California climate model. (In the future it would be interesting to include other climate data, such as ENVIREM.) We will also include spatial predictors that simply describe the arrangement of sample locations in geographic space (MEMs - below). These are important to include in landscape genomic models, because they can account for otherwise unmodeled variables, either from isolation-by-distance (IBD), phylogeographic signal in separate lineages, or unsampled environmental variables.

In this tutorial, all rasters have already been clipped to the study area (Yosemite Toad species distribution), and the coordinate system reprojected to one shared system (lat/long).

First, extract elevation from the DEM.

elevation <- raster("./GIS/current_env/elevation.tif")
toad.env.data <- data.frame( pop=p[,1], long=p[,2], lat=p[,3], elevation=extract(elevation, p[,-1]) )
**Elevation for the study area.**

Elevation for the study area.

Next, extract the 19 BioClim variables. This dataset is modeled at 1 km resolution. Click the link for explanations of the variables, which are summarized for the years 1960-1990.

files1 <- list.files(path="./GIS/current_env", pattern='bio_[0-9]+\\.tif$')
files1 <- mixedsort(files1)
files1 <- paste0("./GIS/current_env/", files1)
r1 <- sapply(files1, function(x) { raster(x) } )
s1 <- stack(r1)
names(s1) <- gsub(".tif", "", names(s1))
names(s1) <- gsub("..GIS.current_env.", "", names(s1))
toad.env.data <- data.frame( toad.env.data, extract(s1, p[,-1]) )
**BioClim variables for the study area. The first 12 out of 19 are plotted to save space.**

BioClim variables for the study area. The first 12 out of 19 are plotted to save space.

Extract California Basin Characterization Model (BCM) variables. This dataset is modeled at 270 m resolution. The BCM variables included for this tutorial are Climatic Water Deficit (CWD), Runoff (RUN), Recharge (RCH), and April 1 Snowpack (APRPCK), summarized across the years 1981-2010, and including average and standard deviation across years. For the first three variables (not including snowpack), we also have data summarized across years for just the summer months (May - August).

files2 <- c(list.files(path="./GIS/current_env", pattern="ave\\.tif$"),
            list.files(path="./GIS/current_env", pattern="std\\.tif$"))
files2 <- paste0("./GIS/current_env/", files2)
r2 <- sapply(files2, function(x) { raster(x) } )
s2 <- stack(r2)
names(s2) <- gsub(".tif", "", names(s2))
names(s2) <- gsub("..GIS.current_env.", "", names(s2))
toad.env.data <- data.frame( toad.env.data, extract(s2, p[,-1] ) )
**BCM variables for the study area.**

BCM variables for the study area.

Resample all rasters to one stack (for prediction/mapping later on). Although we will use the original resolution for finding outliers with the program bayenv2 (below), it will be easier to use one common resolution for predictions and mapping, in this case 1 km resolution.

all.r <- c(elevation, r1, r2)
all.r.res <- list()
for (i in 1:length(all.r)) {
  all.r.res <- c( all.r.res, resample(all.r[[i]], r1[[1]]) ) }
all.s <- stack(all.r.res)
mask <- all.s[[1]]/all.s[[1]] - 1

Create a data frame of raster pixel values, at sample locations only (for prediction/mapping later on).

cellIDs <- extract(mask, p[,-1], cellnumbers=T)[,1]
all.env <- data.frame(cell=cellIDs, toad.env.data)

Create a data frame of raster pixel values, for all pixels in the national parks (for prediction/mapping later on).

poly1 <- intersect(canorus, yose)
poly2 <- intersect(canorus, kica)
all.env.pred1 <- data.frame( do.call("rbind", extract(all.s, poly1, cellnumbers=T ) ) )
all.env.pred2 <- data.frame( extract(all.s, poly2, cellnumbers=T ) )
all.env.pred <- rbind(all.env.pred1, all.env.pred2)
longlat <- xyFromCell(mask, all.env.pred$cell)
all.env.pred <- data.frame(all.env.pred$cell, longlat, all.env.pred[,-1])
names(all.env.pred)[1:3] <- c("cell", "long", "lat")

Repeat all these steps for future (projected) climate data. All future variables are from the Community Climate System Model (CCSM), which was made by the National Center for Atmospheric Research (NCAR) and sponsored by NSF and the Department of Energy. The chosen Representative Concentration Pathway (RCP) is rcp85, meaning worst case scenario: emissions continue to rise throughout the 21st century. The BioClim variables are averaged over 2061-2080 (mean 2070), and the BCM variables are averaged over 2070-2099 (mean 2085).

The code is omitted here to save space, but let’s visualize some of the data.

**Change in BioClim variables for the study area. The first 4 out of 19 are plotted to save space.**

Change in BioClim variables for the study area. The first 4 out of 19 are plotted to save space.

**Change in BCM variables for the study area. The first 4 out of 19 are plotted to save space.**

Change in BCM variables for the study area. The first 4 out of 19 are plotted to save space.

Now let’s perform a PCA of the environmental data so far (elevation, BioClim, BCM). PCA is a type of unconstrained ordination, i.e. a technique for reducing the number of dimensions in the data. Instead of including 1 + 19 + 11 variables, we can include just a few that summarize most of the variance in these environmental predictors.

toad.env.pca <- prcomp(toad.env.data[,-c(1:4)], center=T, scale.=T)
toad.env.pca.data <- data.frame(toad.env.data[,1:4], toad.env.pca$x[,1:5])
plot(toad.env.pca)

autoplot(toad.env.pca, loadings=T, loadings.colour='blue', loadings.label=T, loadings.label.size=3)

summary(toad.env.pca)
## Importance of components:
##                          PC1    PC2     PC3     PC4     PC5     PC6
## Standard deviation     4.406 2.1546 1.42049 1.31327 1.09210 0.56149
## Proportion of Variance 0.647 0.1547 0.06726 0.05749 0.03976 0.01051
## Cumulative Proportion  0.647 0.8017 0.86896 0.92645 0.96621 0.97672
##                            PC7     PC8     PC9    PC10    PC11   PC12
## Standard deviation     0.48623 0.43929 0.31799 0.25543 0.17809 0.1453
## Proportion of Variance 0.00788 0.00643 0.00337 0.00217 0.00106 0.0007
## Cumulative Proportion  0.98460 0.99103 0.99440 0.99658 0.99763 0.9983
##                           PC13    PC14    PC15    PC16    PC17    PC18
## Standard deviation     0.11890 0.09122 0.08990 0.07537 0.06212 0.05393
## Proportion of Variance 0.00047 0.00028 0.00027 0.00019 0.00013 0.00010
## Cumulative Proportion  0.99881 0.99909 0.99936 0.99955 0.99967 0.99977
##                           PC19    PC20    PC21    PC22    PC23    PC24
## Standard deviation     0.04956 0.03622 0.02726 0.02456 0.02250 0.01852
## Proportion of Variance 0.00008 0.00004 0.00002 0.00002 0.00002 0.00001
## Cumulative Proportion  0.99985 0.99990 0.99992 0.99994 0.99996 0.99997
##                           PC25    PC26    PC27    PC28 PC29      PC30
## Standard deviation     0.01681 0.01528 0.01326 0.01093 0.01 5.035e-16
## Proportion of Variance 0.00001 0.00001 0.00001 0.00000 0.00 0.000e+00
## Cumulative Proportion  0.99998 0.99999 0.99999 1.00000 1.00 1.000e+00

Lastly, let’s use Moran’s Eigenvector Map (MEM) analysis to make spatial predictors of the data. As stated above, MEMs are based entirely on the lat/long information, they don’t use environmental or genetic data. They are important to include in landscape genomic models, however, because they can account for otherwise unmodeled variables, either from isolation-by-distance (IBD), phylogeographic signal in separate lineages, or unsampled environmental variables. Check out this tutorial for more on MEM analysis in R.

Briefly, the steps are as follows: 1. Build some kind of network (“spatial neighborhood”) between your points that describes how connected they are. In this case we’ll use a nearest neighbor matrix with k=10 nearest neighbors for each point, and inversely weight distance between points.
2. Calculate MEMs for the spatial neighborhood and its weight matrix. MEMs are a type of ordination (like PCA), except they make orthogonal variables of a geographic distance matrix. The resulting MEMs are ordered by their spatial autocorrelation (SA), from largest positive SA to largest negative SA. In this case we will only be interested in positive SA.
3. Determine which MEM variables are significant.
4. (Optional) choose which significant MEM variables to actually use, based on common sense.

The following command will launch a window that helps you choose a neighborhood scheme.

createlistw()

Next, create a matrix of spatial points, choose the nearest neighbor (with k=10 neighbors).

xy <- as.matrix(p[,-1])
nb <- chooseCN(coordinates(xy), type = 6, k = 10, plot.nb = T)

Now, create a weighting matrix, using inverse distance between points.

distnb <- nbdists(nb, xy)
fdist <- lapply(distnb, function(x) 1 - x/max(dist(xy)))
lw <- nb2listw(nb, style = 'W', glist = fdist, zero.policy = T)

Next, calculate the positive MEM variables.

mem.gab <- mem(lw, MEM.autocor="positive")
barplot(attr(mem.gab, "values"), main = "Eigenvalues of the spatial weighting matrix")

plot(mem.gab, SpORcoords = xy, nb = nb)

Test which ones are significant with alpha of 0.05 and 100 random permutations.

moranI <- moran.randtest(mem.gab, lw, 99)
signi <- which(moranI$pvalue < 0.05)
plot(mem.gab[,signi], SpORcoords = xy, nb = nb)

How many significant positive MEM variables do we have?

length(signi)
## [1] 16

For convenience, and because they appear to best represent phylogeographic structure, let’s only use the first 4 MEM variables.

plot(mem.gab[,signi[1:4]], SpORcoords = xy, nb = nb)

toad.env.pca.data <- data.frame(toad.env.pca.data, mem.gab[,1:4])
all.env <- data.frame(all.env, mem.gab[,1:4])
all.env.fut <- data.frame(all.env.fut, mem.gab[,1:4])

We now have all environmental variables! Let’s center all variables on 0, and scale them by their standard deviations.

toad.env.pca.data[,-1] <- scale(toad.env.pca.data[,-1], center=T, scale=T)

Save the environment variable names for later.

env.names <- names(toad.env.pca.data)[-1]

Sort the environmental data by population order from our bayenv2 input data (used below) to make sure they line up, and output these data.

toad.env.pca.data <- toad.env.pca.data[match(poporder, toad.env.pca.data$pop),]
toad.env.pca.data.r <- t(toad.env.pca.data[,-1])
write.table(toad.env.pca.data.r, "./Data/toad.env.pca.data.txt", quote=F, sep="\t", row.names=F, col.names=F)

3. Search for Outliers with PCAdapt

This approach is analogous to the traditional “outlier test” approach, which assumes an island model of migration between populations, estimates a null distribution of Fst values between said pops, and separately tests whether Fst for each SNP makes sense under this model. The problem with these models is that the island model is completely unrealistic in reality, because isolation-by-distance and hierarchical pop structure don’t meet its assumptions. Hence normal outlier approaches end up with lots of false positives and negatives. Simulations have shown this approach (using PCA to create a “model-free” null distribution of individual distances) is fairly robust to these problems. Check out the vignette on PCAdapt here for more on this method.

Read in the genetic data.

filename <- read.pcadapt(data.file.3, type="ped")
## Summary:
## 
##         - input file      ./Data/data_radsnp_allparks.ped
##         - output file     ./Data/data_radsnp_allparks.pcadapt
## 
##  - number of individuals detected:   644
##  - number of loci detected:      1672
## 
## File has been sucessfully converted.
poplist.names <- as.character(read.table(data.file.3, header=F)[,1])

In the first step, a PCA is performed on the centered and scaled genotype matrix. Specify number of PCs to retain. Ideally this should be the minimum number of PCs needed to describe “neutral” population genetic structure, but we don’t know what this is yet. For now we’ll pick a number (20) that should be overkill, and then refine with a scree plot and score plot (below).

x <- pcadapt(filename,K=20)
## Reading file ./Data/data_radsnp_allparks.pcadapt...
## Number of SNPs: 1672
## Number of individuals: 644
## Number of SNPs with minor allele frequency lower than 0.05 ignored: 725
## 81353 out of 1076768 missing data ignored.

Here’s a scree plot, showing the percentage of variance (eigenvalues) explained by each PC.

plot(x,option="screeplot")

Here’s a score plot, displaying the individual scores on PC1 and PC2.

plot(x,option="scores",i=1,j=2,pop=poplist.names)

Based on the scree plot, K=5 PCs seems like an ideal number of PCs to explain most population structure, so let’s look at a score plot of PC6 and PC7.

plot(x,option="scores",i=6,j=7,pop=poplist.names)

This looks pretty good.

Now, in the second step, each SNP is regressed on K=5 principal components. The multi-dimensional (Mahalanobis) distance of each SNP from the mean covariance is measured using the D statistic.

x <- pcadapt(filename,K=5)
## Reading file ./Data/data_radsnp_allparks.pcadapt...
## Number of SNPs: 1672
## Number of individuals: 644
## Number of SNPs with minor allele frequency lower than 0.05 ignored: 725
## 81353 out of 1076768 missing data ignored.

Let’s look at a summary of the output. Here’s a cheat sheet:
maf: minor allele frequencies
loadings: correlations between each genetic marker and PC
sing.values: K ordered squared root of the proportion of variance explained by each PC
scores: projections of the individuals onto each PC
zscores: z-scores from p-values
stat: Mahalanobis distances (by default)
gif: genomic inflation factor estimated from stat
chi2.stat: rescaled statistics stat/gif that follow a chi-squared distribution
pvalues: z-scores

summary(x)
##                 Length Class  Mode   
## maf             1672   -none- numeric
## loadings        8360   -none- numeric
## singular.values    5   -none- numeric
## scores          3220   -none- numeric
## zscores         8360   -none- numeric
## stat            1672   -none- numeric
## gif                1   -none- numeric
## chi2.stat       1672   -none- numeric
## pvalues         1672   -none- numeric

Here’s a Manhattan plot displaying −log10 of the p-values, obviously some (putative outliers) are must lower than most.

plot(x, option="manhattan")

Here’s Q-Q plot of p-values showing how well they conform to normality. The lowest 10% of p-values are on right side of the blue vertical line, and they don’t fit a normal distribution.

plot(x, option="qqplot", threshold=0.1)

A histogram of p-values shows that most form a uniform distribution, with an excess of low values (putative outliers).

hist(x$pvalues,xlab="p-values", main=NULL, breaks=50)

The presence of outliers is also visible when plotting a histogram of the test statistic D.

plot(x, option="stat.distribution")
## Warning: Removed 1 rows containing non-finite values (stat_bin).

SNPs with q-values less than α (expected FDR) are considered outliers.

qval <- qvalue(x$pvalues)$qvalues
alpha <- 0.05
outliers1.ints <- which(qval<alpha)
outliers1 <- as.character(loci[,1])[outliers1.ints]
outliers1
##  [1] "I188"   "I400"   "I3377"  "I16600" "I18217" "I19912" "I25126"
##  [8] "I25969" "I26600" "I27809" "I28851" "I30567" "I33100" "I33741"
## [15] "I34321" "I34572" "I36849" "I37492" "I37879" "I38790" "I45338"
## [22] "I46940" "I53635" "I57311" "I59867" "I60908" "I62250" "I62839"
## [29] "I64858" "I65037" "I68158" "I69681" "I71246"

Which PCs are the most correlated with the oulier SNPs? Here are the first 10.

snp_pc <- get.pc(x,outliers1.ints)
head(snp_pc, 10)
##       SNP PC
##  [1,] 4   3 
##  [2,] 7   2 
##  [3,] 83  5 
##  [4,] 237 4 
##  [5,] 262 3 
##  [6,] 299 5 
##  [7,] 435 5 
##  [8,] 451 5 
##  [9,] 469 5 
## [10,] 495 1

4. Search for Outliers with Bayenv2

The second class of outlier methods is those explicitly using correlation between the allele frequencies of SNPs and environmental variables putatively selecting them. These tests have their own pros (e.g. explicitly testing the reason for underlying positive selection), and cons (e.g. some programs have lots of false positives due to not controlling for population structure, or a poor choice of test statistic not accounting for multiple tests). The program bayenv2 is one of the better performing programs in this class, and simulations show it does a good job controlling for neutral covariance among populations.

The basic workflow is:
Step 1: Estimate the covariance matrix among all populations.
Step 2: Calculate the probability a SNP has correlation with each environmental variable after accounting for the “neutral” covariance, using Bayes Factors and correlation coeffiecients.

Running bayenv2 will take a long time (hours). So we will read in the output from a previous run here. This program looks for environmental correlations between each SNP and all environmental variables after accounting for the “neutral” covariance between populations.

Read in output from bayenv2.

be.output <- read.table("./Programs/bf_environ.toad.env.pca.data.txt", header=F)
be.output[,1] <- loci
names(be.output)[1] <- "loci"
names(be.output)[seq(2, length(env.names)*3, 3)] <- paste0(env.names, ".BF")
names(be.output)[seq(3, length(env.names)*3, 3)] <- paste0(env.names, ".Spear")
names(be.output)[seq(4, length(env.names)*3+1, 3)] <- paste0(env.names, ".Pear")

Look at output:

be.output[1:5,1:5]
##   loci   long.BF long.Spear long.Pear  lat.BF
## 1  I40   1.22640 -0.1310400 -0.099790 1.54980
## 2  I52   0.92891 -0.0076731  0.023296 1.24100
## 3 I172   1.04600  0.0622830  0.101710 0.82697
## 4 I188 137.31000 -0.2702900 -0.285960 1.68890
## 5 I196   1.17720 -0.1268400 -0.121810 0.83369

Let’s consider an outlier to have Bayes Factor > 20, and correlation coefficient (either pearson or spearman) > |0.2|. We will count this up for each locus by variable.

be.counts <- be.output
x1 <- be.counts[,seq(2, length(env.names)*3, 3)] > 20
x2 <- abs(be.counts[,seq(3, length(env.names)*3, 3)]) > 0.2
x3 <- abs(be.counts[,seq(4, length(env.names)*3+1, 3)]) > 0.2
locXenv <- x1 * (x2 + x3)
locXenv <- locXenv/locXenv
locXenv.sums <- rowSums(locXenv, na.rm=T)
locXenv.sums <- data.frame(loci=be.output$loci, numPos=locXenv.sums)
outliers2 <- as.character(locXenv.sums[which(locXenv.sums$numPos>0),1])
outliers2
##   [1] "I172"   "I188"   "I422"   "I1800"  "I2004"  "I2522"  "I2706" 
##   [8] "I2830"  "I3640"  "I4562"  "I5846"  "I5847"  "I6299"  "I12866"
##  [15] "I13790" "I15957" "I16655" "I18477" "I19373" "I20128" "I20516"
##  [22] "I20524" "I20827" "I21397" "I22381" "I22578" "I24410" "I24420"
##  [29] "I24649" "I24658" "I25173" "I26183" "I26471" "I26969" "I27983"
##  [36] "I28014" "I28646" "I28706" "I30442" "I30567" "I31722" "I31817"
##  [43] "I32283" "I33154" "I33189" "I33303" "I33664" "I33741" "I34572"
##  [50] "I34980" "I35188" "I37016" "I37428" "I37879" "I38163" "I38427"
##  [57] "I38900" "I39394" "I39835" "I39856" "I40203" "I40218" "I41319"
##  [64] "I41670" "I42315" "I42552" "I43270" "I44177" "I44452" "I44929"
##  [71] "I45680" "I45847" "I45866" "I46217" "I46715" "I47282" "I47423"
##  [78] "I47739" "I49070" "I49074" "I49136" "I49818" "I50736" "I53510"
##  [85] "I53710" "I53974" "I54027" "I55572" "I55853" "I57335" "I57983"
##  [92] "I58960" "I58989" "I59664" "I59867" "I60394" "I60414" "I60589"
##  [99] "I60867" "I61064" "I61192" "I61457" "I61563" "I61983" "I62839"
## [106] "I64275" "I64697" "I64851" "I64858" "I65331" "I66687" "I67560"
## [113] "I67875" "I68914" "I68960" "I69397" "I70013" "I70283" "I70378"
## [120] "I70849" "I90652"

Now, which putative outliers were found by BOTH PCAdapt and bayenv2?

outliers <- intersect(outliers1, outliers2)
outliers
## [1] "I188"   "I30567" "I33741" "I34572" "I37879" "I59867" "I62839" "I64858"

What were the environmental correlates?

which.env <- data.frame(loci=outliers, locXenv[be.output$loci %in% outliers,])
names(which.env) <- gsub(".BF", "", names(which.env))
which.env[is.na(which.env)] <- "."
which.env
##     loci long lat elevation PC1 PC2 PC3 PC4 PC5 MEM1 MEM2 MEM3 MEM4
## 1   I188    1   .         .   .   .   .   .   .    .    .    1    .
## 2 I30567    .   .         .   .   .   .   .   .    .    .    .    1
## 3 I33741    .   .         .   .   .   .   .   .    .    .    .    1
## 4 I34572    .   .         1   1   .   .   .   .    .    .    1    1
## 5 I37879    .   .         .   .   .   .   .   .    .    1    .    .
## 6 I59867    .   .         .   .   .   .   .   .    .    .    1    .
## 7 I62839    .   .         .   .   .   .   .   .    .    .    .    1
## 8 I64858    1   .         .   .   .   .   .   .    .    .    1    1

We will keep these as our outlier loci for the rest of this tutorial. We are assuming these represent, to some extent, positive natural selection of the environment on these toads, and these loci (or nearby loci) are good indicators of this process. If feasible, it would be good to follow up such assumptions with other steps to vet these loci (i.e. transcriptomics, common garden experiments, functional genomics, etc.)

5. Map Current and Future Local Adaptation

This section follows Fitzpatrick and Keller (2014). There are two methods that attack the problem of predicting and mapping local adaptation in slightly different ways: gradient forests (GF) and generalized dissimilarity modeling (GDM). GF will be used to predict allele frequencies of “reference” and “outlier” loci across space and time, whereas GDM will predict the genetic structure (Fst) among those populations of frequencies.

5.1. Gradient Forest Modeling

First, read in and format minor allele frequency data.

n.ind <- dim(read.table(data.file.4, skip=1)[,-c(1:2)])[1]/2
n.loc <- dim(read.table(data.file.4, skip=1)[,-c(1:2)])[2]
gi <- read.structure(data.file.4, n.ind=n.ind, n.loc=n.loc, col.lab=1, col.pop=2,
                     onerowperind=F,col.others=0, row.marknames=1, NA.char=0)
## 
##  Converting data from a STRUCTURE .stru file to a genind object...
gi.p <- genind2genpop(gi)
## 
##  Converting data from a genind to a genpop object... 
## 
## ...done.
gi.freq <- makefreq(gi.p)
## 
##  Finding allelic frequencies from a genpop object... 
## 
## ...done.
minor <- colMeans(gi.freq[,seq(1, ncol(gi.freq), 2)], na.rm=T) > 
         colMeans(gi.freq[,seq(2, ncol(gi.freq), 2)], na.rm=T)
keep <- sort(c(which(minor==T)*2-1, which(minor==F)*2))
gi.freq <- data.frame(gi.freq[,keep])
print(gi.freq[1:5,1:10], digits=3)
##      I40.T I52.A I172.A I188.T I196.T I373.G I400.T I422.T I452.G I506.G
## 101      1     1  1.000  1.000  0.833  0.000      1  1.000    NaN  0.300
## 1040     1     1  0.000  0.000  0.500  1.000      1  1.000      1  0.375
## 1070   NaN     1  0.000  0.000  1.000  1.000      1  1.000      1  0.500
## 1071     1     1  1.000  1.000  0.450  0.000      1  1.000      1  0.500
## 1097   NaN     1  0.727  0.455  0.650  0.409      1  0.455      1  0.650

Make an environmental dataset.

envGF <- all.env[match(row.names(gi.freq), all.env$pop),]

Build individual SNP datasets, with 300 randomly-selected reference SNPs, vs. all outlier SNPs.

SNPs_ref <- gi.freq[,sample(length(gi.freq), 300)]
SNPs_out <- gi.freq[,gsub("\\.[ATCG]", "", names(gi.freq)) %in% outliers]

Account for correlations, see ?gradientForest.

maxLevel <- log2(0.368*nrow(envGF)/2)

Fit gf models for reference and outlier SNPs.

gfRef <- gradientForest(cbind(envGF[,-c(1:2)], SNPs_ref), predictor.vars=colnames(envGF[,-c(1:2)]),
                        response.vars=colnames(SNPs_ref), ntree=500, trace=T)
gfOut <- gradientForest(cbind(envGF[,-c(1:2)], SNPs_out), predictor.vars=colnames(envGF[,-c(1:2)]),
                        response.vars=colnames(SNPs_out), ntree=500, trace=T)

Plot the output, see ?plot.gradientForest.

plot(gfRef, plot.type="O")

plot(gfOut, plot.type="O")

Predict/transform allele frequencies using current environment and gf models, see ?predict.gradientForest.

predRef.gf <- predict(gfRef, all.env.pred[,-1])
predOut.gf <- predict(gfOut, all.env.pred[,-1])

Now, let’s map PCAs of reference and outlier SNPs: PCA1 (red), PC2 (green), PC3 (blue).

refRGBmap <- pcaToRaster(predRef.gf, mask, all.env.pred$cell)
plotRGB(refRGBmap, axes=F)

outRGBmap <- pcaToRaster(predOut.gf, mask, all.env.pred$cell)
plotRGB(outRGBmap, axes=F)

Let’s see the Procrustes difference between maps (reference vs. outlier SNPs).

diffMap1 <- RGBdiffMap(predRef.gf, predOut.gf, rast=mask, mapCells=all.env.pred$cell)
plot(diffMap1[[2]], xaxt='n', yaxt='n', bty="n", legend=F, box=F, col=ramp)
addRasterLegend(diffMap1[[2]], location="topright", ramp=ramp, digits=numDig(diffMap1[[2]]), nTicks=4)

Now, predict/transform allele frequencies using FUTURE environment and gf models, see ?predict.gradientForest.

names(all.env.fut.pred) <- names(all.env.pred)
projRef.gf <- predict(gfRef, all.env.fut.pred[,-1])
projOut.gf <- predict(gfOut, all.env.fut.pred[,-1])

Let’s see the procrustes difference (first reference, second outliers) across time (future vs. current environment).

diffMap2 <- RGBdiffMap(projRef.gf, predRef.gf, rast=mask, mapCells=all.env.pred$cell)
plot(diffMap2[[2]], xaxt='n', yaxt='n', bty="n", legend=F, box=F, col=ramp)
addRasterLegend(diffMap2[[2]], location="topright", ramp=ramp, digits=numDig(diffMap2[[2]]), nTicks=4)

diffMap3 <- RGBdiffMap(projOut.gf, predOut.gf, rast=mask, mapCells=all.env.pred$cell)
plot(diffMap3[[2]], xaxt='n', yaxt='n', bty="n", legend=F, box=F, col=ramp)
addRasterLegend(diffMap3[[2]], location="topright", ramp=ramp, digits=numDig(diffMap3[[2]]), nTicks=4)

Lastly, what’s the temporal change in local adaptation (current reference - current outlier) - (future reference - future outlier)? This again uses procrustes analysis.

diffMap4 <- RGBdiffMap2(predRef.gf, predOut.gf, projRef.gf, projOut.gf, rast=mask, mapCells=all.env.pred$cell)
plot(diffMap4[[2]], xaxt='n', yaxt='n', bty="n", legend=F, box=F, col=ramp)
addRasterLegend(diffMap4[[2]], location="topright", ramp=ramp, digits=numDig(diffMap4[[2]]), nTicks=4)

5.2. Generalized Dissimilarity Modeling

Calculate pairwise Weir and Cockerham (1984) Fst for reference and outlier SNPs…

data.fst <- genind2df(gi)
names(data.fst)[-1] <- paste0("I", names(data.fst)[-1])
data.fst <- data.frame(INDIVIDUALS=row.names(gi@tab), POP_ID=data.fst$pop, data.fst[,-1])
ref.fst <- fst_WC84(data = data.fst, pairwise = T, verbose = T, digits = 6)
out.fst <- data.frame(data.fst[,c(1:2)], data.fst[,which(names(data.fst) %in% outliers)])
out.fst <- fst_WC84(data = data.fst, pairwise = T, verbose = T, digits = 6)

…or import ref.fst and out.fst, to save computational time for this tutorial.

ref.fst <- out.fst <- list()
ref.fst$pairwise.fst <- read.table("./Data/ref.fst.txt", header=T)
out.fst$pairwise.fst <- read.table("./Data/out.fst.txt", header=T)

Create a pairwise environmental data frame for GDM.

pairs <- data.frame(t(combn(envGF$pop, 2)))
names(pairs) <- c("POP1", "POP2")
df1 <- df2 <- envGF[,-c(1, (length(envGF)-3):length(envGF))]
names(df1)[-1] <- paste0("s1.", names(df1)[-1])
names(df2)[-1] <- paste0("s2.", names(df2)[-1])
envGDM <- merge(pairs, df1, by.x="POP2", by.y="pop")
envGDM <- merge(envGDM, df2, by.x="POP1", by.y="pop")
envGDM$POP1 <- factor(envGDM$POP1, ordered = T)
envGDM$POP2 <- factor(envGDM$POP2, ordered = T)
envGDM <- data.frame(envGDM[, c(1:4, (3+length(df1)-1), (4+length(df1)-1),
                 5:(3+length(df1)-2), (3+length(df1)+1):length(envGDM) )])
names(envGDM) <- gsub("long", "xCoord", names(envGDM))
names(envGDM) <- gsub("lat", "yCoord", names(envGDM))
envGDM <- data.frame(weights=rep(1, nrow(envGDM)), envGDM)

Build individual SNP datasets, scale Fst to 0-1. This facilitates model convergence.

SNPs_ref <- merge(ref.fst$pairwise.fst[,-4], envGDM, by=c("POP1","POP2"))
SNPs_ref$FST <- (SNPs_ref$FST - min(SNPs_ref$FST)) / (max(SNPs_ref$FST) - min(SNPs_ref$FST))
SNPs_out <- merge(out.fst$pairwise.fst[,-4], envGDM, by=c("POP1","POP2"))
SNPs_out$FST <- (SNPs_out$FST - min(SNPs_out$FST)) / (max(SNPs_out$FST) - min(SNPs_out$FST))

Build GDMs for reference SNPs. We will set geographic distance between populations as a predictor with GEO=T.

gdmRef <- gdm(SNPs_ref[,-c(1:2)], geo=T)
gdmRef$explained
## [1] 82.38797
plot(gdmRef, plot.layout=c(3,4))
refSplines <- isplineExtract(gdmRef) # extract spline data for custom plotting

Build GDMs for outlier SNPs. We will set geographic distance between populations as a predictor with GEO=T.

gdmOut <- gdm(SNPs_out[,-c(1:2)], geo=T)
gdmOut$explained
## [1] 34.20489
plot(gdmOut, plot.layout=c(3,4))
outSplines <- isplineExtract(gdmOut) # extract spline data for custom plotting

Predict/transform allele frequencies using current environment and gdm models, see ?gdm.transform.

all.env.pred[is.na(all.env.pred)] <- 0
predRef.gdm <- gdm.transform(gdmRef, all.env.pred[,-1])
predOut.gdm <- gdm.transform(gdmOut, all.env.pred[,-1])

Now, let’s map PCAs of reference and outlier SNPs: PCA1 (red), PC2 (green), PC3 (blue).

refRGBmap <- pcaToRaster(predRef.gdm, mask, all.env.pred$cell)
plotRGB(refRGBmap, axes=F)

outRGBmap <- pcaToRaster(predOut.gdm, mask, all.env.pred$cell)
plotRGB(outRGBmap, axes=F)

Let’s see the Procrustes difference between maps (reference vs. outlier SNPs).

diffMap5 <- RGBdiffMap(predRef.gdm, predOut.gdm, rast=mask, mapCells=all.env.pred$cell)
plot(diffMap5[[2]], xaxt='n', yaxt='n', bty="n", legend=F, box=F, col=ramp)
addRasterLegend(diffMap5[[2]], location="topright", ramp=ramp, digits=numDig(diffMap5[[2]]), nTicks=4)

Predict/transform allele frequencies using FUTURE environment and gdm models, see ?gdm.transform.

names(all.env.fut.pred) <- names(all.env.pred)
all.env.fut.pred[is.na(all.env.fut.pred)] <- 0
projRef.gdm <- gdm.transform(gdmRef, all.env.fut.pred[,-1])
projOut.gdm <- gdm.transform(gdmOut, all.env.fut.pred[,-1])

Let’s see the procrustes difference (first reference, second outliers) across time (future vs. current environment).

diffMap6 <- RGBdiffMap(projRef.gdm, predRef.gdm, rast=mask, mapCells=all.env.pred$cell)
plot(diffMap6[[2]], xaxt='n', yaxt='n', bty="n", legend=F, box=F, col=ramp)
addRasterLegend(diffMap6[[2]], location="topright", ramp=ramp, digits=numDig(diffMap6[[2]]), nTicks=4)

diffMap7 <- RGBdiffMap(projOut.gdm, predOut.gdm, rast=mask, mapCells=all.env.pred$cell)
plot(diffMap7[[2]], xaxt='n', yaxt='n', bty="n", legend=F, box=F, col=ramp)
addRasterLegend(diffMap7[[2]], location="topright", ramp=ramp, digits=numDig(diffMap7[[2]]), nTicks=4)

Lastly, what’s the temporal change in local adaptation (current reference - current outlier) - (future reference - future outlier)? This again uses procrustes analysis.

diffMap8 <- RGBdiffMap2(predRef.gdm, predOut.gdm, projRef.gdm, projOut.gdm, rast=mask, mapCells=all.env.pred$cell)
plot(diffMap8[[2]], xaxt='n', yaxt='n', bty="n", legend=F, box=F, col=ramp)
addRasterLegend(diffMap8[[2]], location="topright", ramp=ramp, digits=numDig(diffMap8[[2]]), nTicks=4)

6. How Generalizable are these Results?

What proportion of the species distribution is the study area?

p1 <- spTransform(poly1, crs("+proj=utm +zone=11 +datum=WGS84"))
p2 <- spTransform(poly2, crs("+proj=utm +zone=11 +datum=WGS84"))
c  <- spTransform(canorus, crs("+proj=utm +zone=11 +datum=WGS84"))
(gArea(p1)+gArea(p2))/gArea(c)
## [1] 0.284189

For a fairly endemic species such as the Yosemite Toad, it is possible to leverage spatially and genomically dense datasets to infer species-wide patterns. Although the study area is only about 1/4 to 1/3 of the distribution, this number is probably an underestimate, because populations north of Yosemite NP are sparse and/or extirpated. In reality, Yosemite and Kings Canyon NPs comprise about 40% of all populations. These models are a good generalization of the entire species. Population and landscape genomics are good heuristic tools that can help build more specific hypotheses about adaptive evolution within species.