Monday, September 12, 2016

Note: Parallelization not working with Maxent models

For the time being and for the foreseeable future, Maxent models aren't working with multiple cores.  This is due to an issue with the mclapply and rJava functions in R; rJava just straight-up does not work with mclapply, and as far as I can tell there's no way to make it do so.  As it stands, ENMTools just sets the number of cores to 1 for any of the tests when "type" is set to "mx".  If anyone knows, or discovers, a workaround for this please do let me know!

Thursday, September 8, 2016

Many core functions now parallelized

I've now got parallelized code running for the background and identity tests, as well as the linear and blob rangebreak tests.  The ribbon rangebreak test is going to take longer, because it contains some necessary failure detection code that needs to be wrapped differently than the other tests.

This code is not yet on the main branch on GitHub, it's on the branch named "apply".

As it stands, each of the functions by default uses all of the cores available in the system.  You can decrease that by just supplying a "cores = x" argument, where x is however many cores you want it to use.  If you're happy using all of the cores on your system, you can just call the functions exactly as before.

Obviously the speed differences here are going to depend on how many cores you have on your system.  I've got a 24-core machine I'm working on right now, and going from 1 core to 24 on my test data results in massive speed increases - identity and background tests for 20 reps drop from ~10 minutes to ~1 minute.  Pretty slick!

Anyway, give it a shot if you can and let me know if you run into any issues with it.  Thanks to Nick Huron for reminding me!

Small but useful change: ENMTools R package now auto-recognizes lat and lon columns

Just a quick little note to let you know that the ENMTools R package now automatically looks for columns named "x" and "y", or beginning with "lat" and "lon" as part of its pre-analysis check of the species objects.  It's a little thing, but it should prevent a lot of errors and confusion.

Thursday, September 1, 2016

Age-overlap correlation tests and building an enmtools.clade object


NOTE: I wrote much of the core code for this at the PhyloDevelopeR workshop/hackathon in Nantucket last week, and I just wanted to express my appreciation for all I learned there.  I want to thank Liam Revell for hosting the workshop, and April Wright and Klaus Schliep for all their work organizing the workshop and teaching all of the classes.  It was a great experience, and I learned a lot. Thanks!



2nd note: Blogger keeps absolutely wrecking my R code.  Sorry about that.  Try copy/pasting the bits in grey boxes into a text editor, they should come out fine there.



The ENMTools R package now allows you to do age-overlap correlation tests.  This is a generalization of the age-range correlation tests of Barraclough and Vogler, Fitzpatrick and Turelli, etc.  Basically what these methods do is take the average (topologically corrected) overlap across each node in the tree, and perform a linear regression of overlap as a function of time.  The significance of the departure of that slope and intercept from zero is estimated via a Monte Carlo test, in which the tips of the tree are shuffled randomly.

Although this method was originally developed for used with overlap between species' geographic ranges, it has since been adapted for use with niche overlap metrics (Knouft et al. 2006, Warren et al. 2008) and metrics for similarity of point distributions (Cardillo and Warren 2016).  The current implementation in the ENMTools R package does all of the above, with one slick interface.

To run an AOC test, first you need to build an enmtools.clade object.  This involves building a set of enmtools.species objects, and then lumping them together into an enmtools.clade object with a phylogeny that includes all of those species.  It's important that the tip labels of the tree and the names of the species objects match, of course.  Here we'll build a clade for five Hispanolan anoles, using a tree called "hisp.anoles".

brev.clade=enmtools.clade(species = list(brevirostris, marron, caudalis, websteri, distichus), tree = hisp.anoles)
check.clade(brev.clade)

## 
## 
## An enmtools.clade object with 5 species
## 
## Species names: 
##   brevirostris    caudalis    distichus   marron      websteri
## 
## Tree: 
## 
## Phylogenetic tree with 5 tips and 4 internal nodes.
## 
## Tip labels:
## [1] "brevirostris" "caudalis"     "distichus"    "marron"      
## [5] "websteri"    
## 
## Rooted; includes branch lengths.
## 
## 
## Data Summary: 
##              species.names  in.tree presence background range    
## brevirostris "brevirostris" TRUE    175      0          "present"
## caudalis     "caudalis"     TRUE    16       0          "present"
## distichus    "distichus"    TRUE    628      0          "present"
## marron       "marron"       TRUE    11       0          "present"
## websteri     "websteri"     TRUE    17       0          "present"

Easy, right?  Pay attention to the summary table at the end there, because it can help you to spot problems before they happen.  These AOC analyses can take a LONG time and if you have a run that aborts halfway through it really sucks.  Regardless of what kind of analysis you're doing you want to be sure that "in.tree" reads as "TRUE" for every species.  The "presence" and "background" columns tell you how many points each of your species has for the presence and background points, and the "range" column tells you whether each species has a range raster.

You don't need all of those data types for every analysis - which ones you do need will be determined by the type of analysis you end up doing.  For the methods that use ENMs (overlap.source = "bc", "dm", "glm", "gam", and "mx"), you will need presence points at a minimum and will have to supply a set of environmental layers for the function's "env" argument.  For overlap.source = "range", every species needs to have a range raster with numerical values within the species range and NA values outside of it.  For the point overlap method, all you need is presence points.

Okay, let's do an old fashioned age-range correlation!

range.aoc=enmtools.aoc(clade = brev.clade,  nreps = 50, overlap.source = "range")
summary(range.aoc)
## 
## 
## Age-Overlap Correlation test
## 
## 50 replicates 
## 
## p values:
## (Intercept)         age 
##  0.07843137  0.03921569


Cool, right?  This shows that the slope of the relationship between range overlap and node age is statistically significantly different from zero, meaning species ranges tend to overlap more as they become more distantly related.  It's a bit iffy to infer anything too meaningful from five species, but that suffices to demonstrate the functionality anyway.

You can do the same thing with point overlap metrics:

point.aoc = enmtools.aoc(clade = brev.clade,  nreps = 50, overlap.source = "points")
Or with ENM overlaps:
bc.aoc = enmtools.aoc(clade = brev.clade,  env = hisp.env, nreps = 50, overlap.source = "bc")
For ENM overlaps you can use any method of ENM construction that ENMTools knows ("bc", "dm", "gam", "glm", "mx"), and any metric of overlap that ENMTools knows ("D", "I", "cor", "env.D", "env.I", "env.cor").  You can also provide a formula for GLM and GAM models.  For instance:

glm.aoc = enmtools.aoc(clade = brev.clade,  env = hisp.env, nreps = 50, overlap.source = "glm", model = presence ~ snv_1 + snv_10, metric = "I")
This produces an analysis of ENM overlap as a function of time, with ENMs built using GLMs with presence modeled as a function of snv_1 + snv_10, using the I overlap metric from Warren et al. 2008.

Just to reiterate: these analyses can get VERY time-consuming, particularly for clades with a lot of species.



Addendum: In the future it'd be pretty interesting to hook this whole thing up to GLM instead of regular old lm, so we could get away from the assumption that the relationship between time and overlap is necessarily just a straight line.  Shouldn't be too hard to do, but I have so many other plans at the moment that it's not super high on my priorities list.

Monday, August 22, 2016

Quick FYI about lat/lon data in the ENMTools R package

In case any of you are hitting errors with the ENMTools R package, just be aware that at present it assumes that your data has longitude as the first column and latitude as the second column.  I've got some code to fix this, but I'm traveling at the moment and don't have the ability to merge it into the master branch quite yet.  I'll get to it soon (promise!) but for now just format your data in the way it expects to see it.  Thanks to Utku Perktas for reminding me!

Tuesday, August 16, 2016

Passing args to enmtools.maxent()

Thanks to comments from Matthew King and Nicholas Huron, I've been chasing down some bugs in how enmtools.maxent passes arguments to the "args" parameter of dismo's maxent function.  There were a couple of issues here: first, I'd just flat-out screwed something up.  That's fixed now, so go get the newest version before you do any maxenting.  Second, it just doesn't recognize "args" automatically, so you need to explicitly assign it when you call the function.  For instance, this doesn't work:

my.args =c("betamultiplier=0.5", "product=FALSE", "hinge=FALSE", "threshold=FALSE")

allogus.mx.args = enmtools.maxent(allogus, env, my.args)


But this does:

my.args =c("betamultiplier=0.5", "product=FALSE", "hinge=FALSE", "threshold=FALSE")

allogus.mx.args = enmtools.maxent(allogus, env, args = my.args)


Happy maxenting!

Sunday, August 14, 2016

Automatic report generation

One of the things I'm working on now on the develop branch of enmtools is code to automatically generate html reports on model structure and performance.  The goal here is to provide an accessible maxent-style output with as little hassle as possible.  The current structure is just a skeleton, but I think it's already pretty neat.  Here's a sample html report for a GAM, exactly as it comes out of enmtools:


Summary of ENMTools gam object for allogus

Summary of ENMTools gam object for allogus

Spatial prediction

plot of chunk plot-suitability






Model: presence ~ s(layer.1, k = 4) + s(layer.2, k = 4) + s(layer.3, , k = 4) + s(layer.4, k = 4)

plot of chunk response-plots

## 
## Family: binomial 
## Link function: logit 
## 
## Formula:
## presence ~ s(layer.1, k = 4) + s(layer.2, k = 4) + s(layer.3, 
##     k = 4) + s(layer.4, k = 4)
## 
## Parametric coefficients:
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept)  -3.3732     0.1911  -17.65   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Approximate significance of smooth terms:
##              edf Ref.df Chi.sq  p-value    
## s(layer.1) 1.641  1.994  1.101  0.57526    
## s(layer.2) 1.000  1.001 26.379 2.81e-07 ***
## s(layer.3) 2.850  2.963 10.804  0.00856 ** 
## s(layer.4) 2.725  2.922  8.414  0.03309 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## R-sq.(adj) =  0.0661   Deviance explained = 12.7%
## UBRE = -0.63881  Scale est. = 1         n = 1052







Evaluation

Geographic space

plot of chunk eval-geo-train

## class          : ModelEvaluation 
## n presences    : 52 
## n absences     : 1000 
## AUC            : 0.7715385 
## cor            : 0.2319033 
## max TPR+TNR at : -2.695771



## 
## 
## Proportion of data wittheld for model testing:
## [1] 0.2

plot of chunk eval-geo-test

## class          : ModelEvaluation 
## n presences    : 13 
## n absences     : 1000 
## AUC            : 0.7910385 
## cor            : 0.1300592 
## max TPR+TNR at : -3.014335





Environment space

plot of chunk eval-env-train

## class          : ModelEvaluation 
## n presences    : 52 
## n absences     : 10000 
## AUC            : 0.5357038 
## cor            : -0.0329415 
## max TPR+TNR at : 0.01546464


## 
## 
## Proportion of data wittheld for model testing:
## [1] 0.2

plot of chunk eval-env-test

## class          : ModelEvaluation 
## n presences    : 13 
## n absences     : 10000 
## AUC            : 0.5380154 
## cor            : -0.01697503 
## max TPR+TNR at : 0.01534051





Model fit using gam.check

plot of chunk model-fit

## 
## Method: UBRE   Optimizer: outer newton
## full convergence after 9 iterations.
## Gradient range [-4.371328e-07,2.558592e-06]
## (score -0.6388078 & scale 1).
## Hessian positive definite, eigenvalue range [4.369759e-07,0.0004284836].
## Model rank =  13 / 13 
## 
## Basis dimension (k) checking results. Low p-value (k-index<1) may
## indicate that k is too low, especially if edf is close to k'.
## 
##               k'   edf k-index p-value
## s(layer.1) 3.000 1.641   0.932    0.47
## s(layer.2) 3.000 1.000   0.927    0.36
## s(layer.3) 3.000 2.850   0.862    0.02
## s(layer.4) 3.000 2.725   0.806    0.00





Notes

## [1] "No formula was provided, so a GAM formula was built automatically"





Citations

Warren, D.L. (2016) Package ‘enmtools’. Available online at: https://github.com/danlwarren/ENMTools

Hijmans, R.J, Phillips, S., Leathwick, J. and Elith, J. (2011), Package ‘dismo’. Available online at: http://cran.r-project.org/web/packages/dismo/index.html.