Thursday, July 4, 2019

ENMTools R package now has a DOI

DOI

I finally got some time to put together the first real beta release of ENMTools, and got a doi number for it.  If you're using the R version of ENMTools, please cite it as follows for now:

Warren, D.L., N. Matzke, M. Cardillo, J. Baumgartner, L. Beaumont, N. Huron, M. Simões, and R. Dinnage.  2019.  ENMTools (Software Package).  URL: https://github.com/danlwarren/ENMTools. doi:10.5281/zenodo.3268814

Friday, June 28, 2019

Now on biorXiv: Evaluating species distribution models with discrimination accuracy is uninformative for many applications

I've just recently posted a new paper to biorXiv entitled "Evaluating species distribution models with discrimination accuracy is uninformative for many applications", which I did in collaboration with Nick Matzke and Teresa Iglesias.  Many of you have seen me present this work at conferences over the past few years, but this is the first time it's been available in a publicly-accessible format.  The conclusions of this paper are fairly problematic for SDM/ENM studies where the goal is to estimate the response of the species to a set of environmental gradients, rather than just to estimate its distribution, and so I feel like it's worthwhile to basically do a preemptive breakdown and FAQ about what we are, and aren't, saying with this one, and to try to respond to some obvious questions it raises.

Main conclusion: Using a very large set of simulations, we find that using discrimination accuracy (AUC/TSS/kappa) on occurrence data is, under a broad range of conditions, not very useful for selecting models that accurately estimate the niche.  Here, "accurately estimating the niche" means getting the relative suitability of habitat right.  In other words, models that get your species' current spatial distribution right are not necessarily models that accurately estimate the relative suitability of habitat and vice versa. 

A very similar point was demonstrated by Phillips and Elith in their 2010 paper "Discrimination capacity in species distribution models depends on the representativeness of the environmental domain", but nine years later discrimination capacity remains the primary method for evaluating SDM/ENM models.  Here we demonstrate the magnitude of this problem using a broader simulation approach that is intended to be more closely approximate real SDM/ENM studies, and ask to what extent this phenomenon affects our ability to infer the niche.  The answer is that under many realistic conditions, we simply cannot tell good niche estimates from bad ones this way. 

How is this possible?  People have actually known that discrimination accuracy can be largely misinformative about calibration for quite some time, and this has even been brought up multiple times before in the SDM/ENM literature.  We're not really demonstrating a new problem here, so much as demonstrating the scope of a known problem.

As for how this is possible, consider the following plot:


Here we've got a bunch of presence and absence (or background/pseudoabsence) data, distributed along a one-dimensional environmental gradient.  The solid lines represent six different models we could use to explain or generalize this species' distribution along that gradient.  These are obviously very different models, right?  The issue is that discrimination accuracy essentially asks "how well do you assign higher scores to your presence data than you do to your non-presence data?", and in this example every one of these models assigns higher scores to all of the presence points than they do to the non-presence points.  

At first pass the message you might get from this is "okay, discrimination accuracy is saying some models are good that seem quite wacky, but it's also saying that some models are good that might be quite sensible".  That's true, but the real issue is not just that it's assigning high discrimination accuracy to all of the models; it's actually assigning the same accuracy to all of thees models.  That means that if you had these data and this set of models in hand, you would be unable to choose between them using discrimination accuracy alone; their performance is identical.  

This points to the root of the problem, as we see it; if your goal is to estimate a species' continuous response to the environment, discrimination accuracy has very little signal that is useful for model selection because you may have an arbitrarily high number of possible models that have the same (or similar) discrimination accuracy.  

Using methods that involve a high number of predictors and complex response functions exacerbates this problem by a substantial amount; we found that discrimination accuracy is most useful for model selection only when you have a few (less than ~4) predictors and when the true marginal suitability function is a realizable case of the modeling algorithm.  This means that discrimination accuracy isn't very effective at model selection when you're using (for example) some sort of step function to approximate a smooth one.

That's the statistical argument, but there's an ecological argument to be made here as well: we know that species' spatial distributions are shaped by a number of phenomena besides their environmental niches (biotic interactions, historical biogeographic processes, etc.).  This means that an accurate estimate of environmental habitat suitability will, in many cases, over- or under-estimate the actual distribution of the species.  Finding that good distribution estimates sometimes aren't great niche estimates (and vice versa) therefore shouldn't be surprising from a purely biological perspective.

Does this mean my models are bad?  It doesn't necessarily mean that at all!  What it means is that, if you're in a spot where you have a set of candidate models that have a range of performance using discrimination accuracy, discrimination accuracy is often not useful for choosing between them.  If all of your models are good niche estimates (which they might be), this may just mean that you didn't necessarily pick the best niche estimate out of the batch.  

It does have broader implications for the methodological literature, however.  Over the past few decades, we have developed a set of "best practices" for SDM/ENM studies.  These "best practices" have been favored largely based on studies where people quantify models' ability to predict withheld occurrence data via some measure of discrimination accuracy.  Our results imply that those "best practices" may only be best practices when the goal is to estimate species' distributions, and we may need to entirely reevaluate which of those methods work best for making good niche estimates. 

Is it possible that you just screwed up the simulations? It's always possible, but we did a lot of checking and double-checking to make sure our simulations do what they say on the tin.  Honestly the results are worrying enough that I would be quite happy to find out that this is all just due to me screwing up the simulation at some point that I haven't managed to find yet.  I will say, however, that I've recently been doing an entirely separate set of simulations for another purpose using entirely different methods of simulating species and occurrence data, and have found very similar results.

But you didn't use method X (where X = some method of choosing predictors, selecting study area, favored algorithm, etc.)!  This is a comment that seems to pop up every time this work is presented at a meeting or seminar.  It's true that we didn't explore every possible combination of methods in our analyses, and model construction was basically automated.  We did try to explore a pretty decent space of common methods though, including seven different algorithms, two different sizes of study region, spatially partitioned vs. random model validation, and varying levels of model and niche complexity.  Some of it seemed to make a bit of difference, but the only thing we could find that seemed to really improve things markedly was making models with few predictors, and making models with biologically realistic response functions (i.e., those that matched the true niche).  

As a result of exploring all of those combinations of approaches, this study involved the construction of many thousands of models, and we can't really do that in a hands-on fashion.  "It's a lot of work" isn't really a great argument, though, so let me clarify our reasoning on this topic.  There are a few things to note here:

  1. In order to explore the relationship between discrimination accuracy and our ability to estimate the niche (which we're calling "functional accuracy"), we actually need models with a range of performance on both of those axes, including some that are actually bad at one or the other, or both.
  2. The question we're addressing is not "can we make good niche estimates?", or "can we make good distribution estimates?", or even "how do we make the best niche and/or distribution estimates?"  The question we're asking has to do with whether or not evaluating models as distribution estimates helps us pick good niche estimates.  As such, using methods that have been shown to make better distribution estimates won't necessarily affect our results at all.  In fact, even finding methods that make better distribution and niche estimates won't affect our results unless those methods also produce a tighter correlation between discrimination and functional accuracy.  Given that we have no a priori reason to think that this is the case, along with the computation time involved, we decided the issue wasn't worth investing another year of computer time in.  If, however, someone can demonstrate that there is a set of methods that produces a strong correlation between discrimination and functional accuracy I will be absolutely thrilled.  One thing I think might be particularly interesting for future exploration is sample size.  We kept sample size constant here at 100 (75 training, 25 testing).  I suspect that higher sample sizes might expand the space of methods for which discrimination accuracy is useful in model selection, but that's a question for another day.
  3. Even assuming that some combination of methods does exist that produces a strong correlation between discrimination and functional accuracy, we think it's very useful to demonstrate that there's a broad swath of approaches for which those two things are not correlated.  Even if it's not broken everywhere, we feel there's great value to demonstrating that it's broken in a lot of places.
  4. Many of the methods people in the field might be inclined to suggest exploring are considered "best practices" due to previous methodological work in the field.  That's reasonable, but we'd highlight the fact that most of those methodological studies favored those methods specifically because they maximize discrimination accuracy.  Given that we're showing that discrimination accuracy is largely uninformative here for selecting niche estimates, their performance in those studies is not necessarily relevant here; see point 2.

Why post on biorXiv?  The paper is currently in review, and we're hopeful that it will be accepted before too long.  However, I'm currently working on several projects that build on this work, and need a citation target for it.  Others have asked for something they can cite too, so I decided to just go ahead and put it out there.  

Where can I get the simulation code?  I've got a github repo for it here, but I'm still in the process of making it prettier and easier to understand.  You're welcome to download it now, but if you wait a few days it will likely be far more compact and comprehensible.

How should I change my methods if I'm concerned about this?  As stated above, to me the big take-homes are:
  1. Use few predictors in model construction, which you think are important in limiting the species' distribution.
  2. Use methods that produce biologically plausible response functions.
  3. Visualize marginal suitability functions, and think critically about them as biological estimates.

These suggestions basically mirror general best practices in any sort of modeling.  However, to my mind we as a field (myself very much included) have often allowed ourselves to violate these common-sense principles in pursuit of models that maximize discrimination accuracy.  In light of what Teresa, Nick, and I have found with these simulations, though, I think we should be much more careful about our models as niche estimates, and much less beholden to methods that make maps that are tightly fit to species' current spatial distributions.

One final note: In addition to de-emphasizing discrimination accuracy somewhat in empirical modeling studies, I hope that two things come out of this study.  First, I hope that future studies will examine the performance of alternative model selection metrics.  Second, I sincerely hope that this will help to emphasize how much SDM/ENM needs a robust simulation literature (see also this excellent review).  The question we examined in this study seems pretty fundamental to the field, and yet there really hadn't been much in the way of prior simulation work on it.  I think there are a ton of other low-hanging fruit out there that need to be reexamined in a simulation context; we've got a field where a lot of things have been justified by verbal arguments and appeals to common sense, and we really need to figure out whether that stuff actually works.


Addenda:

From Twitter, paraphrased: Why not evaluate other methods of measuring model fit (e.g., calibration)?  In this study, our goal was just to look at the most commonly used metrics of model fit in the literature, and so we stuck with discrimination accuracy on occurrence points.  People do measure model calibration and other aspects of model performance, but they're a tiny fraction of the published studies.  I do strongly suspect that calibration will be a much better tool for model selection when the goal is to estimate the niche, and we should definitely look at that soon.


Monday, February 18, 2019

Low memory options for rangebreak tests are now available on develop branch

There are now options available for writing replicate models to disk for the rangebreak tests, instead of storing them in the output object.  This works in exactly the same way as it does for the identity and background tests, as outlined here:

http://enmtools.blogspot.com/2019/02/low-memory-usage-options-for.html

Again it's currently only on the develop branch, but we'll move it to main before too terribly long.

RWTY 1.0.2 now on CRAN

The newest version of RWTY, version 1.0.2 is now on CRAN.  This is a relatively minor release except for one significant bug fix: there was an issue causing one of the plotting functions to fail when it was called with a single MCMC chain (as opposed to multiple MCMC chains).  That's fixed now, and all is well.

The other major change (for the worse, IMO) is that we had to remove the "Plot Comparisons" vignette to get the package in under CRAN's size restrictions.  That's a really useful vignette, so it sucks to have to remove it.  There are just too many images to make it fit the size requirements, though, and since the entire point of that vignette is comparing those images it doesn't make sense to remove them.  You can still get the vignette installed with RWTY by installing from GitHub or just check it out here: http://danwarren.net/plot-comparisons.html

Wednesday, February 13, 2019

Low memory usage options for identity.test and background.test

This is something I've been meaning to do for a while, but just finally got around to because it was screwing someone's analysis up. 

Originally, the ENMTools R package was designed to store all replicate models in the output object for the identity and background tests.  While that's fine for low resolution or small extent studies, it got to be a real problem for people working with high-resolution data over larger geographic extents.

To deal with this, I've created options for background.test and identity.test that allow you to save the replicate models to .Rda files instead of storing them in the output object.  When called using this option, the replicate.models entries in the identity.test and background.test objects contain paths to the saved model files instead of containing the models themselves.

By default these functions just store models in the working directory, but you can specify a directory to save them to instead if you prefer. 

To run these tests using the low memory options, just pass the argument low.memory = TRUE.  If you want to pass it a directory to save to, just add rep.dir = "PATH", where PATH is your directory name.

Be warned that replicate models WILL be overwritten if they exist.  It's a good idea to make a separate directory for the reps from each analysis.

This new functionality is currently only implemented on the "develop" branch, but we'll move it over to the main branch soon.

Monday, January 21, 2019

Fun fact: you can run a whole bunch of models at once using ENMTools quite easily

The ENMTools R package contains a function called "species.from.file". This takes a .csv file and creates a list of species objects, one for each value in the "species" column in your .csv file. So you can do:

species.list <- species.from.file("myspecies.csv")

and you'll get back a list with ENMTools species in it. If you wanted to run a bunch of models using those species with the same settings, you could then do:

my.models <- lapply(species.list, function(x) enmtools.gam(x, climate.layers, test.prop = 0.3))

where "climate.layers" is your raster stack. That would create a list of ENMTools GAM objects, setting aside 30% of the data from each for testing.

Friday, October 5, 2018

Why add correlations for suitability scores?

Hey y'all!  After a conversation with some colleagues, I realized that I sort of added Spearman rank correlation as a measure of overlap to ENMTools without really explaining why.  Here is a quick and dirty sketch of my thinking on that.

Previous measures of overlap between suitability rasters that were implemented in ENMTools were based on measures of similarity between probability distributions.  That's fine as far as it goes, but my feeling is that it's more useful as a measure of species' potential spatial distributions than as a measure of similarity of the underlying model.   Here's a quick example:

# Perfect positive correlation
sp1 = seq(0.1, 1.0, 0.001)
sp2 = seq(0.1, 1.0, 0.001)


You can only see one line there because the two species are the same.  I wrote a function called olaps to make these plots and to calculate three metrics of similarity that are implemented in ENMTools.

olaps(sp1, sp2)

$D
[1] 1

$I
[1] 1

$cor
[1] 1

Okay, that's all well and good - perfectly positively correlated responses get a 1 from all metrics.  Now what if they're perfectly negatively correlated?

# Perfect negative correlation
sp1 = seq(0.1, 1.0, 0.001)
sp2 = seq(1.0, 0.1, 0.001)
olaps(sp1, sp2)
$D
[1] 0.590455

$I
[1] 0.8727405

$cor
[1] -1

What's going on here?  Spearman rank correlation tells us that they are indeed negatively correlated, but D and I both have somewhat high values!  The reason is that the values of the two functions are fairly similar across a fairly broad range of environments, even though the functions themselves are literally as different as they could possibly be.  Thinking about what this means in terms of species occurrence is quite informative; if the threshold for suitability for a species to occur is low (e.g., .0005 in this cartoon example), they might co-occur across a fairly broad range of environments; both species would find env values from 250 to 750 suitable and might therefore overlap across about 2/3 of their respective ranges.  That's despite them having completely opposite responses to that environmental gradient, strange though that may seem.

So do you want to measure the potential for your species to occupy the same environments, or do you want to measure the similarity in their estimated responses to those environments?  That's entirely down to what question you're trying to answer!

Okay, one more reason I kinda like correlations:

# Random
sp1 = abs(rnorm(1000))
sp2 = abs(rnorm(1000))


Here I've created a situation where we've got complete chaos; the relationship of both species to the environment is completely random.  Now let's measure overlaps:

olaps(sp1, sp2)

$D
[1] 0.5641885

$I
[1] 0.829914

$cor
[1] -0.04745993

Again we've got fairly high overlaps between species using D and I, but Spearman rank correlation is really close to zero.  That's exactly what we'd expect if there's no signal at all.  Of course the fact that species distributions and the environment are both spatially autocorrelated means that we'll likely have higher-than-zero (at least in absolute value) correlations even if there is no causal relationship between the environment and species distributions, but at least it's nice to know that we do have a clear expected value when chaos reigns.



Code for this is here:

https://gist.github.com/danlwarren/5509c6a45d4e534c3e0c0ecf1702bbdd