Friday, August 9, 2019

Calibration metrics added to ENMTools R package

Given our recent paper on how badly discrimination accuracy performs at selecting models that estimate ecological phenomena, the obvious next question is whether there are existing alternatives that do a better job.  The answer is an emphatic YES; calibration seems to be a much more useful method of selecting models for most purposes.

This is obviously not the first time that calibration has been recommended for species distribution models; Continuous Boyce Index (CBI) was developed for this purpose (Boyce et al. 2002, Hirzel et al. 2006), Phillips and Elith (2010) demonstrated the utility of POC plots for assessing and recalibrating models, and Jiménez-Valverde et al. (2013) argued convincingly that calibration was more useful for model selection than discrimination for many purposes.  However, discrimination accuracy still seems to be the primary method people use for evaluating species distribution models.

A forthcoming simulation study by myself, Marianna Simões, Russell Dinnage, Linda Beaumont, and John Baumgartner is demonstrating exactly how stark the difference between discrimination and calibration is when it comes to model selection; calibration metrics are largely performing fairly well, while discrimination accuracy is actively misleading for many aspects of model performance.  The differences we're seeing are sufficiently stark that I would go so far as to recommend against using discrimination accuracy for model selection for most practical purposes. 

We're writing this study up right now, and in the interests of moving things forward as quickly as possible we'll be submitting it to biorXiv ASAP - likely within the next week or two.  As part of that study, though, I've implemented a number of calibration metrics for ENMTools models, including expected calibration error (ECE), maximum calibration error (MCE), and CBI.  We did not implement Hosmer-Lemeshow largely because ECE is calculated in a very similar way as HL and can be used in statistical tests in much the same way, but scales more naturally (a perfect model has an ECE of 0).  In our simulation study we found that ECE was the best performing metric for model selection by far, so for now that's what I personally will be using.

Fairly soon we'll integrate the calibration metrics into all of the model construction functions (e.g., enmtools.glm, enmtools.maxent, etc.).  For now though, you can get calibration metrics and plots just by using the enmtools.calibrate function.

That will give you two different flavors of ECE and MCE as well as CBI and some diagnostic plots.  ECE and MCE calculations are done using the CalibratR package (Schwarz and Heider 2018), and CBI is calculated using functions from ecospat (DiCola et al. 2017).

Boyce, M.S., Vernier, P.R., Nielsen, S.E., and Schmiegelow, F.K.A. 2002. Evaluating resource selection functions. Ecological Modeling 157:281-300)
Hirzel, A.H., Le Lay, G., Helfer, V., Randon, C., and Guisan, A. 2006. Evaluating the ability of habitat suitability models to predict species presences. Ecological Modeling 199:142-152.


  1. Hi Dan, this is all fantastic work and I cannot wait for the next paper to come by. I would like to ask, though, how does the selection criterias fit into all this? Are AICc or BIC still relevant in today's context of model selection (for example during the tuning of MaxEnt parameters)? Especially seeing how a couple of papers have recently raised a few issues with using AICc (Peterson et al., 2018; Velasco & González-Salazar, 2019).

    TL:DR what are your thoughts on AICc?

    Thank you!

  2. Hi there! Sorry for the slow reply; I was on holiday.

    My feelings about information criterion-based model selection for Maxent are a bit complicated, as we know that it's wrong at some level because of the disconnect between the number of parameters and the effective degrees of freedom, and yet simulation studies show that it can work better than Maxent's default behavior at selecting optimal model complexity. Obviously these concerns don't apply so much to GLM/GAM models though, so AICc in general I think is still pretty solid despite those concerns.

    It's worth noting that many of the other studies in this area (including the Velasco & González-Salazar one, but I can't recall about the Peterson et al. one) are evaluating how well the selected models predict binary presence/absence, while Seifert and I were looking at how well selected models predicted the continuous suitability scores. The fact that we get different answers on the performance of AICc at model selection may be entirely due to how we're evaluating what a "good" model is based on our simulations; as my recent paper with Matzke and Iglesias reinforces, there may be little connection between a model that makes good distributions predictions and one that actually predicts the relative suitability of habitat well.

    1. Thank you so much for your reply!
      This question may be answered in your paper to come, but I really want to hear your general thoughts as well. With how your thoughts on AICc is mixed, given the disconnect with certain parameters of MaxEnt, would you simple outs AICc and simply stick to using ECE when building SDM models for ecological studies? Or would you have a sort of consider all metrics method, like how some papers now try to balance AUC AICc and OR for when selecting a model.

      There is just so much going on in the methods and I am unsure on what the best approach should be. Just any general advice would be great! I just really wanted a peak into your outlook on SDMs.

    2. Actually in the new round of sims ECE isn't coming out so well either, at least on raw model outputs. We're tweaking our approach there, though, and it might get a lot better once we get that sorted. What we were thinking was going to be one paper is now looking like two at the very least.

      At the moment we're finding that continuous Boyce index is the most reliable calibration metric, and some of our new env space discrimination metrics are working way better than geographic space discrimination metrics as well. When your goal is to estimate the niche, that is. Even if nothing else comes out of this, the message "choose your metrics to suit your model's application" is a big take-home.

    3. Oh wow, that sounds tough and great at the same time. Thank so much for your input! Really looking forward for those papers, really inspired by your work. Good luck with ECE!

  3. Hi Dan, very interesting post!
    I have a couple questions.
    The getMCE function of the CalibratR library asks for two arguments: actual (vector of observed class labels (0/1)) and predicted (vector of uncalibrated predictions).
    How do I get the vector of observed class labels (0/1)?
    How do I vector of uncalibrated predictions?

    Best regards,

    1. I don't know how, but I hadn't gotten any notification about this reply! Sorry! You can actually just call enmtools.calibrate on any ENMTools model object, which takes care of all of that for you.

  4. This function is not yet available in standalone software of ENMTools i guess. Is it possible that enmtools in r package in future be able to use the outputs from maxent runs?

    1. There is no new development on the standalone ENMTools and won't be in the future, we're only focusing on the R package now.

  5. This comment has been removed by the author.