Friday, July 22, 2016

Background/similarity tests in the ENMTools R package

Okay, let's use our two species to run a background/similarity test.  This works a lot like the identity test (see the post preceding this one), but there's a new option called "test.type" that can be set to "asymmetric" or "symmetric".  Here's an asymmetric background test using Bioclim:




bg.bc.asym = background.test(species.1 = ahli, species.2 = allogus, env = env, type = "bc", nreps = 99, test.type = "asymmetric")
bg.bc.asym
## 
## 
##  
## 
## Asymmetric background test ahli vs. allogus background
## 
## background test p-values:
##        D        I rank.cor 
##      0.32      0.76      0.43 
## 
## 
## Replicates:
## 
## 
## |          |         D|         I|  rank.cor|
## |:---------|---------:|---------:|---------:|
## |empirical | 0.1328502| 0.3177390| 0.0706201|
## |rep 1     | 0.1430965| 0.3114858| 0.0824412|
## |rep 2     | 0.1284871| 0.2801639| 0.0156034|
## |rep 3     | 0.1599120| 0.3384525| 0.1136082|
## |rep 4     | 0.1431022| 0.3101197| 0.0766638|
plot of chunk unnamed-chunk-17


What is "symmetric" vs. "asymmetric"?  Well, an asymmetric test means that we are comparing the empirical overlap to a null distribution generated by comparing one species' real occurrences to the background of another (species.1 vs. background of species.2).  In the Warren et al. 2008 paper we used this sort of asymmetric test, repeating it in each direction (species.1 vs. background of species.2 and species.2 vs. background of species.1).  While we had the idea that that might generate some interesting biological insight, I think it's generated just as much (if not more) confusion.  For this reason, the new R package also provides the option to do symmetric tests.  These tests compare the empirical overlap to the overlap expected when points are drawn randomly from the background of both species (species.1 background vs. species.2 background), keeping sample sizes for each species constant, of course.

And now a symmetric background test using Domain:


bg.dm.sym = background.test(species.1 = ahli, species.2 = allogus, env = env, type = "dm", nreps = 99, test.type = "symmetric")

bg.dm.sym
## 
## 
##  
## 
## Symmetric background test ahli background vs. allogus background
## 
## background test p-values:
##        D        I rank.cor 
##      0.38      0.36      0.21 
## 
## 
## Replicates:
## 
## 
## |          |         D|         I|  rank.cor|
## |:---------|---------:|---------:|---------:|
## |empirical | 0.1328502| 0.3177390| 0.0706201|
## |rep 1     | 0.2382775| 0.4428653| 0.1774936|
## |rep 2     | 0.1518903| 0.3555431| 0.1002003|
## |rep 3     | 0.1250674| 0.3029139| 0.0717565|
## |rep 4     | 0.1165355| 0.2946842| 0.0841041|
plot of chunk unnamed-chunk-20


Wednesday, July 20, 2016

Running an identity/equivalency test in the ENMTools R package

Okay, let's say we've got two enmtools.species objects: ahli and allogus.  How can we run an identity test?

Here's what we need:


  1. Our two species
  2. Our RasterStack of environmental layers
  3. The type of model we'd like ("glm", "bc", "dm", or "mx", for GLM, Bioclim, Domain, or Maxent)
  4. A formula (GLM only)
  5. The number of reps to perform

So here's how we'd run an identity test using GLM for our two species. 




id.glm = identity.test(species.1 = ahli, species.2 = allogus, env = env, type = "glm", f = presence ~ layer.1 + layer.2 + layer.3 + layer.4, nreps = 99)

Doing 99 reps takes a while, but when you're done, you get an "identity.test" object.  That contains all sorts of useful information.  A quick summary will show you some of it:

id.glm
## 
## 
##  
## 
## Identity test ahli vs. allogus
## 
## Identity test p-values:
##        D        I rank.cor 
##      0.01      0.01      0.01 
## 
## 
## Replicates:
## 
## 
## |          |         D|         I|   rank.cor|
## |:---------|---------:|---------:|----------:|
## |empirical | 0.2221752| 0.4661581| -0.4761597|
## |rep 1     | 0.8883545| 0.9899271|  0.8942366|
## |rep 2     | 0.8486324| 0.9828760|  0.9315827|
## |rep 3     | 0.8227838| 0.9742077|  0.8881490|
## |rep 4     | 0.7255044| 0.9469161|  0.5551645|
plot of chunk unnamed-chunk-14
If you want to access the empirical or replicate models, those are stored in that object as well:

names(id.glm)
[1] "description"               "reps.overlap"              "p.values"                  "empirical.species.1.model" "empirical.species.2.model"
[6] "replicate.models"          "d.plot"                    "i.plot"                    "cor.plot"     

As with building species models, identity.test works pretty much the same for Domain, Bioclim, and Maxent models with the exception that you don't need to supply a formula.

Monday, July 18, 2016

Using an enmtools.species object to build an ENM

Now that we've got our enmtools.species object and have assigned it presence and background data and a species name, we can use it to build models very simply!  For this functionality, ENMTools is basically just acting as a wrapper for dismo, using those functions to actually build models.  At present ENMTools only has interfaces for GLM, Maxent, Bioclim, and Domain, but that will change with time.

So let's use our enmtools.species object "ahli" to build a quick Bioclim model.  I've got a RasterStack object made up of four environmental layers.  It's named "env", and the layers are just "layer.1", etc.

Let's build a model!

ahli.bc = enmtools.bc(species = ahli, env = env)

For Bioclim, Domain, and Maxent, it's that easy!  ENMTools extracts the presence and background data from env using the data stored in the species object, builds a model, and returns some lovely formatted output.

For GLM we need to supply a formula as well, but other than that it's idential.


ahli.glm = enmtools.glm(f = pres ~ layer.1 + layer.2 + layer.3 + layer.4, species = ahli, env = env)

Let's look at the output:

ahli.glm
## 
## 
## Formula:  presence ~ layer.1 + layer.2 + layer.3 + layer.4
## 
## 
## 
## Data table (top ten lines): 
## 
## | layer.1| layer.2| layer.3| layer.4| presence|
## |-------:|-------:|-------:|-------:|--------:|
## |    2765|    1235|    1174|     252|        1|
## |    2289|    1732|     957|     231|        1|
## |    2158|    1870|     983|     253|        1|
## |    2207|    1877|     967|     259|        1|
## |    2244|    1828|     945|     249|        1|
## |    2250|    1766|     919|     235|        1|
## |    2201|    1822|     978|     277|        1|
## |    2214|    1786|     986|     284|        1|
## |    2287|    1722|     992|     266|        1|
## |    2984|     965|    1311|     237|        1|
## 
## 
## Model:  
## Call:
## glm(formula = f, family = "binomial", data = analysis.df)
## 
## Deviance Residuals: 
##      Min        1Q    Median        3Q       Max  
## -0.67171  -0.20485  -0.14150  -0.09528   3.08762  
## 
## Coefficients:
##              Estimate Std. Error z value Pr(>|z|)  
## (Intercept) 46.178922  24.777923   1.864   0.0624 .
## layer.1     -0.013347   0.006276  -2.127   0.0334 *
## layer.2     -0.011985   0.006612  -1.813   0.0699 .
## layer.3      0.003485   0.006586   0.529   0.5967  
## layer.4     -0.009092   0.021248  -0.428   0.6687  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 164.58  on 1015  degrees of freedom
## Residual deviance: 150.15  on 1011  degrees of freedom
## AIC: 160.15
## 
## Number of Fisher Scoring iterations: 8
## 
## 
## 
## Suitability:  
## class       : RasterLayer 
## dimensions  : 418, 1535, 641630  (nrow, ncol, ncell)
## resolution  : 0.008333333, 0.008333333  (x, y)
## extent      : -86.90809, -74.11642, 19.80837, 23.2917  (xmin, xmax, ymin, ymax)
## coord. ref. : NA 
## data source : in memory
## names       : layer 
## values      : 6.419793e-08, 0.999983  (min, max)
plot of chunk unnamed-chunk-8
You get pretty, formatted output and a nice plot as well.  Next up: identity/equivalency tests!



Saturday, July 16, 2016

The new R version of ENMTools is in the works! Here's how to build an enmtools.species object.

But for real this time.  I've started over entirely from scratch, and I'm using the new R package as a foundation for some novel analyses that I'm developing as part of my current research.  You can download it and view a fairly lengthy manual of what's currently implemented here:

https://github.com/danlwarren/ENMTools


For reasons that will become clear with time (when some of the downstream stuff gets finished), the way you interface with ENMTools is going to be a bit different from how you work with dismo or Biomod.  First off, you start by defining enmtools.species objects for each species (or population) that you want to compare.


Here I'll create one called ahli (based on data from Anolis ahli).



ahli = enmtools.species()

Now that doesn't have any data associated with it, so if we get a summary of it, we basically just hear back from R that we don't have any data.



ahli
## 
## 
## Range raster not defined.
## 
## Presence points not defined.
## 
## Background points not defined.
## 
## Species name not defined.

So let's add some data:





ahli$species.name = "ahli"
ahli$presence.points = read.csv("test/testdata/ahli.csv")[,3:4]
ahli$background.points = background.points.buffer(ahli$presence.points, 20000, 1000, env[[1]])
ahli

And then look at it again:



## 
## 
## Range raster: 
## class       : RasterLayer 
## dimensions  : 418, 1535, 641630  (nrow, ncol, ncell)
## resolution  : 0.008333333, 0.008333333  (x, y)
## extent      : -86.90809, -74.11642, 19.80837, 23.2917  (xmin, xmax, ymin, ymax)
## coord. ref. : NA 
## data source : in memory
## names       : layer.1 
## values      : 1, 1  (min, max)
## 
## 
## 
## Presence points (first ten only): 
## 
## | Longitude| Latitude|
## |---------:|--------:|
## |  -80.0106|  21.8744|
## |  -79.9086|  21.8095|
## |  -79.8065|  21.7631|
## |  -79.8251|  21.8095|
## |  -79.8807|  21.8374|
## |  -79.9550|  21.8374|
## |  -80.3446|  22.0136|
## |  -80.2983|  21.9951|
## |  -80.1776|  21.9023|
## |  -80.1591|  21.9673|
## 
## 
## Background points (first ten only): 
## 
## | Longitude| Latitude|
## |---------:|--------:|
## | -79.78726| 21.72920|
## | -79.82892| 21.73754|
## | -79.83726| 21.69587|
## | -80.01226| 22.01254|
## | -79.63726| 21.76254|
## | -79.92892| 21.78754|
## | -79.99559| 22.12920|
## | -79.81226| 21.87087|
## | -80.30392| 22.07920|
## | -79.97892| 21.85420|
## 
## 
## Species name:  ahli

Neat, huh?  Next up I'll show you how to build an ENM.


Friday, March 18, 2016

SiS Repost: Monte Carlo methods, nonparametric tests and you

Often when people first are introduced to ENMTools, a point of confusion arises when they come to the point of having to compare their empirical observations to a null distribution: it’s not something they’ve done so explicitly before, and they’re not quite sure how to do it.  In this post I’m going to try to explain in the simplest possible terms how hypothesis testing, and in particular nonparametric tests based on Monte Carlo methods, work.

Let’s say we’ve got some observation based on real data.  In our case, we’ll say it’s a measurement of niche overlap between ENMs built from real occurrence points for a pair of species (figure partially  adapted (okay, stolen) from a figure by Rich Glor).  We have ENMs for two species, and going grid cell by grid cell, we sum up the differences between those ENMs to calculate a summary statistic measuring overlap, in this case D.
Picture
Due to some evolutionary or ecological question we’re trying to answer, we’d like to know whether this overlap is what we’d expect under some null hypothesis.  For the sake of example, we’ll talk about the “niche identity” test of Warren et al. 2008.  In this case, we are asking whether the occurrence points from two species are effectively drawn from the same distribution of environmental variables.  If that is the case, then whatever overlap we see between our real species should be statistically indistinguishable from the overlap we would see under that null hypothesis.  But how do we test that idea quantitatively?

In the case of good old parametric statistics, we would do that by comparing our empirical measurement to a parametric estimate of the overlap expected between two species (i.e., we would say "if the null hypothesis is true, we would expect an overlap of 0.5 with a standard deviation of .05", or something like that).  That would be fine if we could accurately make a parametric estimate of the expected distribution of overlaps under that null hypothesis, i.e., we need to be able to specify a mean and variance for expected overlap under that null hypothesis.  How do we do that?  Well, unfortunately, in our case we can’t.  For one thing we simply can’t state that null in a manner that makes it possible for us to put numbers on those expectations.  For another, standard parametric statistics mostly require the assumption that the distribution of expected measurements under the null hypothesis meets some criteria, the most frequent being that the distribution is normal.  In many cases we don’t know whether or not that’s true, but in the case of ENM overlaps we know it’s probably not true most of the time.  Overlap metrics are bound between 0 and 1, and if the null hypothesis generates expectations that are near one of those extremes, the distribution of expected overlaps is highly unlikely to be even approximately normal.  There can also be (and this is based on experience), multiple peaks in those null distributions, and a whole lot of skew and kurtosis as well.  So a specification of our null based on a normal distribution would be a poor description of our actual expectations under the null hypothesis, and as a result any statistical test based on parametric stats would be untrustworthy.  I have occasionally been asked whether it’s okay to do t-tests or other parametric tests on niche overlap statistics, and, for the reasons I’ve just listed, I feel that the answer has to be a resounding “no”.

So what’s the alternative?  Luckily, it’s actually quite easy.  It’s just a little less familiar to most people than parametric stats are, and requires us to think very precisely about the ideas we’re trying to test.  In our case, what we need to do is to find some way to estimate the distribution of overlaps expected between a pair of species using this landscape and these sample sizes if they were effectively drawn from the same distribution of environments.  What would that imply?  Well, if each of these sets of points were drawn from the same distribution, we should be able to generate overlap values similar to our empirical measurement by repeating that process.  So that’s exactly what we do!

We take all of the points for these two species and we throw them in a big pool.  Then we randomly pull out points for two species from that pool, keeping the sample sizes consistent with our empirical data.  Then we build ENMs for those sets of points and measure overlaps between them.  That gives us a single estimate of expected overlaps under the null hypothesis.  So now we’ve got our empirical estimate (red) and one realization of the null hypothesis (blue)
Picture
All right, so it looks like based on that one draw from the null distribution, our empirical overlap is a lot lower than you’d expect.  But how much confidence can we have in this conclusion can we have based on one single draw from the null distribution?  Not very much.  Let’s do it a bunch more times and make a histogram:
Picture
All right, now we see that, in 100 draws from that null distribution, we never once drew an overlap value that was as low as the actual value that we get from our empirical data.  This is pretty strong evidence that, whatever process generated our empirical data, it doesn’t look much like the process that generated that null distribution, and based on this evidence we can statistically reject that null hypothesis.  But how do we put a number on that?  Easy!  All we need to do is figure out what the percentile in that distribution is that corresponds to our empirical measurement.  In this case our empirical value is lower than the lowest number in our null distribution.  That being the case, we can’t specify exactly what the probability of getting our empirical result is, only that it’s lower than the lowest value we obtained, so it’s p < (whatever that number is).  Since we did 100 iterations of that null hypothesis (and since our empirical result is also a data point), the resolution of our null distribution is 1/(100 + 1) ~= .01.  Given our resolution, that means p is between 0 and .01 or, as we normally phrase it, p < .01.  If we’d done 500 simulation runs and our empirical value was still lower than our lowest value, it would be p < 1/(500 + 1), or p < .0002.  If we’d done 500 runs and found that our empirical value was between the lowest value and the second lowest value, we would know that .0002 < p < .0004, although typically we just report these things as p < .0004.  Basically the placement of our empirical value in the distribution of expected values from our null hypothesis is an estimate of the probability of getting that value if that hypothesis were true.  This is exactly how hypothesis testing works in parametric statistics, the only difference being that in our case we generated the null distribution from simulations rather than specifying it mathematically.

So there you go!  We now have a nonparametric test of our hypothesis.  All we had to do was (1) figure out precisely what our null hypothesis was, (2) devise a way to generate the expected statistics if that hypothesis were true, (3) generate a bunch of replicate realizations of that null hypothesis to get an expected distribution under that null, and (4) compare our empirical observations to that distribution.  Although this approach is certainly less easy than simply plugging your data into Excel and doing a t-test or whatnot, there are many strengths to the Monte Carlo approach. For instance, we can use this approach to test pretty much any hypothesis that we can simulate – as long as we can produce summary statistics from a simulation that are comparable to our empirical data, we can test the probability of observing our empirical data under the set of assumptions that went into that simulated data.  It also means we don’t have to make assumptions about the distributions that we’re trying to test – by generating those distributions directly and comparing our empirical results to those distributions, we manage to step around many of the assumptions that can be problematic for parametric statistics.

The chief difficulty in applying this method is in steps 2 and 3 above – we have to be able to explicitly state our null hypothesis, and we have to be able to generate the distribution of expected measurements under that null.  Honestly, though, I think this is actually one of the greatest strengths of Monte Carlo methods: while this process may be more intensive than sticking our data into some plug-and-chug stats package, it requires us to think very carefully about what precisely our null hypothesis means, and what it means to reject it.  It requires more work, but more importantly it requires a more thorough understanding of our own data and hypotheses.

Wednesday, March 16, 2016

Symposium at Evolution 2016: Putting evolution into ecological niche modeling: Building the connection between phylogenies, paleobiology, and species distribution models

Just a heads up that Nick Matzke and I are organizing a symposium on integrating evolutionary thinking with niche and distribution modeling at this year's Evolution conference in Austin.  We've got some great speakers lined up, and are looking forward to a very productive and informative exchange of ideas.  Make sure to attend if you're at the meeting!

http://www.evolutionmeetings.org/special-talks.html

Thursday, January 21, 2016

Workshop: Model-based statistical inference in ecological and evolutionary biogeography. Barcelona, Spain, Nov. 28 - Dec. 2 2016.

Hey everybody!  Nick Matzke and I are working up a course to run down some of the standard methods in ecological and evolutionary biogeography, including some of the new methods he and I are working on for integrating evolution into species distribution models.  There are still some bits of it in flux, but at the bare minimum this is a chance to learn BioGeoBears and ENMTools (via the R package, which is developing rapidly) from their respective sources.  Our first iteration of the course will be this November/December in Barcelona.  Feel free to contact me if your institution would be interested in hosting a similar workshop sometime in 2017 or later, and please do enroll for Barcelona if you're interested!

Here's the official course announcement:
Nick Matzke and Dan Warren will be teaching a course entitled "Model-based Statistical Inference in Ecological and Evolutionary Biogeography" in Barcelona from November 28 to Dec 2 this year.
This course will cover the theory and practice of widely used methods in evolutionary and ecological biogeography, namely ecological niche modelling / species distribution modelling, and ancestral range estimation on phylogenies.
The course will cover both the practical challenges to using these techniques (the basics of R, obtaining and processing geographical occurrence data from GBIF, setting up and using the models), and the assumptions that various models and methods make.
R packages we will use include rgbif, dismo, ENMTools, and BioGeoBEARS.
Finally, this course will introduce several new approaches being developed by the instructors for linking ecological and evolutionary models.
For more details or to enroll, please see the Transmitting Science web site.

http://www.transmittingscience.org/courses/biog/statistical-biogeography/