## Friday, October 5, 2018

### Why add correlations for suitability scores?

Hey y'all!  After a conversation with some colleagues, I realized that I sort of added Spearman rank correlation as a measure of overlap to ENMTools without really explaining why.  Here is a quick and dirty sketch of my thinking on that.

Previous measures of overlap between suitability rasters that were implemented in ENMTools were based on measures of similarity between probability distributions.  That's fine as far as it goes, but my feeling is that it's more useful as a measure of species' potential spatial distributions than as a measure of similarity of the underlying model.   Here's a quick example:

# Perfect positive correlation
sp1 = seq(0.1, 1.0, 0.001)
sp2 = seq(0.1, 1.0, 0.001)

You can only see one line there because the two species are the same.  I wrote a function called olaps to make these plots and to calculate three metrics of similarity that are implemented in ENMTools.

olaps(sp1, sp2)

\$D
[1] 1

\$I
[1] 1

\$cor
[1] 1

Okay, that's all well and good - perfectly positively correlated responses get a 1 from all metrics.  Now what if they're perfectly negatively correlated?

# Perfect negative correlation
sp1 = seq(0.1, 1.0, 0.001)
sp2 = seq(1.0, 0.1, 0.001)
olaps(sp1, sp2)
\$D
[1] 0.590455

\$I
[1] 0.8727405

\$cor
[1] -1

What's going on here?  Spearman rank correlation tells us that they are indeed negatively correlated, but D and I both have somewhat high values!  The reason is that the values of the two functions are fairly similar across a fairly broad range of environments, even though the functions themselves are literally as different as they could possibly be.  Thinking about what this means in terms of species occurrence is quite informative; if the threshold for suitability for a species to occur is low (e.g., .0005 in this cartoon example), they might co-occur across a fairly broad range of environments; both species would find env values from 250 to 750 suitable and might therefore overlap across about 2/3 of their respective ranges.  That's despite them having completely opposite responses to that environmental gradient, strange though that may seem.

So do you want to measure the potential for your species to occupy the same environments, or do you want to measure the similarity in their estimated responses to those environments?  That's entirely down to what question you're trying to answer!

Okay, one more reason I kinda like correlations:

# Random
sp1 = abs(rnorm(1000))
sp2 = abs(rnorm(1000))

Here I've created a situation where we've got complete chaos; the relationship of both species to the environment is completely random.  Now let's measure overlaps:

olaps(sp1, sp2)

\$D
[1] 0.5641885

\$I
[1] 0.829914

\$cor
[1] -0.04745993

Again we've got fairly high overlaps between species using D and I, but Spearman rank correlation is really close to zero.  That's exactly what we'd expect if there's no signal at all.  Of course the fact that species distributions and the environment are both spatially autocorrelated means that we'll likely have higher-than-zero (at least in absolute value) correlations even if there is no causal relationship between the environment and species distributions, but at least it's nice to know that we do have a clear expected value when chaos reigns.

Code for this is here:

https://gist.github.com/danlwarren/5509c6a45d4e534c3e0c0ecf1702bbdd