Thursday, September 1, 2016

Age-overlap correlation tests and building an enmtools.clade object

NOTE: I wrote much of the core code for this at the PhyloDevelopeR workshop/hackathon in Nantucket last week, and I just wanted to express my appreciation for all I learned there.  I want to thank Liam Revell for hosting the workshop, and April Wright and Klaus Schliep for all their work organizing the workshop and teaching all of the classes.  It was a great experience, and I learned a lot. Thanks!

2nd note: Blogger keeps absolutely wrecking my R code.  Sorry about that.  Try copy/pasting the bits in grey boxes into a text editor, they should come out fine there.

The ENMTools R package now allows you to do age-overlap correlation tests.  This is a generalization of the age-range correlation tests of Barraclough and Vogler, Fitzpatrick and Turelli, etc.  Basically what these methods do is take the average (topologically corrected) overlap across each node in the tree, and perform a linear regression of overlap as a function of time.  The significance of the departure of that slope and intercept from zero is estimated via a Monte Carlo test, in which the tips of the tree are shuffled randomly.

Although this method was originally developed for used with overlap between species' geographic ranges, it has since been adapted for use with niche overlap metrics (Knouft et al. 2006, Warren et al. 2008) and metrics for similarity of point distributions (Cardillo and Warren 2016).  The current implementation in the ENMTools R package does all of the above, with one slick interface.

To run an AOC test, first you need to build an enmtools.clade object.  This involves building a set of enmtools.species objects, and then lumping them together into an enmtools.clade object with a phylogeny that includes all of those species.  It's important that the tip labels of the tree and the names of the species objects match, of course.  Here we'll build a clade for five Hispanolan anoles, using a tree called "hisp.anoles".

brev.clade=enmtools.clade(species = list(brevirostris, marron, caudalis, websteri, distichus), tree = hisp.anoles)

## An enmtools.clade object with 5 species
## Species names: 
##   brevirostris    caudalis    distichus   marron      websteri
## Tree: 
## Phylogenetic tree with 5 tips and 4 internal nodes.
## Tip labels:
## [1] "brevirostris" "caudalis"     "distichus"    "marron"      
## [5] "websteri"    
## Rooted; includes branch lengths.
## Data Summary: 
##              species.names  in.tree presence background range    
## brevirostris "brevirostris" TRUE    175      0          "present"
## caudalis     "caudalis"     TRUE    16       0          "present"
## distichus    "distichus"    TRUE    628      0          "present"
## marron       "marron"       TRUE    11       0          "present"
## websteri     "websteri"     TRUE    17       0          "present"

Easy, right?  Pay attention to the summary table at the end there, because it can help you to spot problems before they happen.  These AOC analyses can take a LONG time and if you have a run that aborts halfway through it really sucks.  Regardless of what kind of analysis you're doing you want to be sure that "in.tree" reads as "TRUE" for every species.  The "presence" and "background" columns tell you how many points each of your species has for the presence and background points, and the "range" column tells you whether each species has a range raster.

You don't need all of those data types for every analysis - which ones you do need will be determined by the type of analysis you end up doing.  For the methods that use ENMs (overlap.source = "bc", "dm", "glm", "gam", and "mx"), you will need presence points at a minimum and will have to supply a set of environmental layers for the function's "env" argument.  For overlap.source = "range", every species needs to have a range raster with numerical values within the species range and NA values outside of it.  For the point overlap method, all you need is presence points.

Okay, let's do an old fashioned age-range correlation!

range.aoc=enmtools.aoc(clade = brev.clade,  nreps = 50, overlap.source = "range")
## Age-Overlap Correlation test
## 50 replicates 
## p values:
## (Intercept)         age 
##  0.07843137  0.03921569

Cool, right?  This shows that the slope of the relationship between range overlap and node age is statistically significantly different from zero, meaning species ranges tend to overlap more as they become more distantly related.  It's a bit iffy to infer anything too meaningful from five species, but that suffices to demonstrate the functionality anyway.

You can do the same thing with point overlap metrics:

point.aoc = enmtools.aoc(clade = brev.clade,  nreps = 50, overlap.source = "points")
Or with ENM overlaps:
bc.aoc = enmtools.aoc(clade = brev.clade,  env = hisp.env, nreps = 50, overlap.source = "bc")
For ENM overlaps you can use any method of ENM construction that ENMTools knows ("bc", "dm", "gam", "glm", "mx"), and any metric of overlap that ENMTools knows ("D", "I", "cor", "env.D", "env.I", "env.cor").  You can also provide a formula for GLM and GAM models.  For instance:

glm.aoc = enmtools.aoc(clade = brev.clade,  env = hisp.env, nreps = 50, overlap.source = "glm", model = presence ~ snv_1 + snv_10, metric = "I")
This produces an analysis of ENM overlap as a function of time, with ENMs built using GLMs with presence modeled as a function of snv_1 + snv_10, using the I overlap metric from Warren et al. 2008.

Just to reiterate: these analyses can get VERY time-consuming, particularly for clades with a lot of species.

Addendum: In the future it'd be pretty interesting to hook this whole thing up to GLM instead of regular old lm, so we could get away from the assumption that the relationship between time and overlap is necessarily just a straight line.  Shouldn't be too hard to do, but I have so many other plans at the moment that it's not super high on my priorities list.

No comments:

Post a Comment