Sunday, July 31, 2016

Hey, what's up with those environmental overlaps in the ENMTools R package?

I'm so glad I asked!  The env.overlap metrics produced by the ENMTools R package are based on methods developed by John Baumgartner and myself.  The purpose of these metrics is to address one of the key issues with the niche overlap metrics currently implemented in ENMTools and elsewhere; the difference between the geographic distribution of suitability and the distribution of suitability in environment space.

Existing methods in ENMTools and most other packages measure similarity between models via some metric that quantifies the similarity in predicted suitability of habitat in geographic space.  While this may be exactly the sort of thing you'd like to measure if you're wondering about the potential for species to occupy the same habitat in an existing landscape, it can be somewhat misleading if the availability of habitat types on the landscape is strongly biased.  For instance, what if two species have very little niche overlap in environment space, but that overlap happens to occur in a combination of environments that turns out to be very common in the current landscape?

To illustrate, let's take two species (red and blue) and look at their niches in environment space:

So they're pretty different, right? But what if the only available environments in the study region occur in that area of overlap?  E.g., what if the current environment space is represented by the green area here?

Well then the only environments we have within which we can measure similarity between species happen to be those environments that are suitable for both!  This means that our measure of overlap between models in geographic space could be arbitrarily disconnected from the actual similarity between those models in environment space.  Depending on the sort of question we're trying to ask, that could be quite misleading.

The method of measuring overlap developed by Broennimann et al. (2012) deals with this issue to some extent.  However, those methods only work in two dimensions, and only work by using a kernel density approach based on occurrence points in environment space.  My guess is that that's still way better than what the original ENMTools approach did for most purposes, but it's not very useful if (for instance) you want to ask how similar the environmental predictions of a GLM are to, say, an RF model.  You simply can't do it.  Or if you want to ask questions in a higher dimensional space, you're basically out of luck.

So what can we do?  Can we figure out a way to measure overlap between two arbitrary models in a n-dimensional space?  It turns out that this is not easy to do exactly, but you can get approximate measures to an arbitrary level of precision fairly easily!

Our approach leverages the fact that R already has great packages for doing Latin hypercube sampling.  This allows us to draw random, but largely independent, points from that n-dimensional environment space.  We can then use dismo's predict function to project our models to those points in environment space, and measure suitability differences between species.  Obviously throwing just a couple of points into a 19-dimensional space (for instance, if you're using all Bioclim variables) isn't going to get you very close to the truth, but of course if you keep throwing more points in there you will get closer and closer to the true average similarity between models across the space.

SO that's what the method does: it starts by making a random Latin hypercube sample of 10,000 points from the space of all combinations of environments, with each variable bound by it's maximum and minimum in the current environment space.  Then it chucks another 10,000 points in there, and it asks how different the answer with 20,000 points is compared to the answer with 10,000.  Then repeat for 30,000 vs. 20,000, and so on, until subsequent measures fall below some threshold tolerance level.  This allows us to get arbitrarily close to the true overlap by specifying our tolerance level.  Lower tolerances take longer and longer to process, since it requires more samples for the value to converge.  However, we've found that with tolerances set at around .001 we get very consistent results for a 4-dimensional comparison with an execution time of around two seconds.

Pretty cool, huh?  Now you can compare the environmental predictions of any two models that can be projected using ENMTools' predict() function in environment space, instead of just looking at their projections into a given geographic space!


  1. Interesting... would be cool to compare the results from the geographic niche overlap and env niche overlap, particularly in cases like your example above.

    1. Absolutely! I've done so with a few and they are surprisingly divergent in many cases.