The script and data in this file draws on several sources, including:
Here we will set up your workspace, download and clean occurrence data.
First, if you are new to using R, we strongly suggest you visit this website and fo through that material before trying this.
You should be able to run all this code in EcoCommons’ coding cloud using either a jupyter notebook (R Kernal) or by importing this notebook into R studio as an R Markdown.
Let’s start by installing and loading the following packages (need to install and load the package pacman
first!):
::p_load(galah, rgbif, maps, tidyr, jpeg, raster, rgeos, sp, knitr, rmarkdown)
pacmansetwd(dirname(rstudioapi::getActiveDocumentContext()$path)) #sets your working directory to the location your scrip or markdown is saved in
<- "/Users/s2992269/Documents/Use_cases"
direct <- "/SDM_in_R"
folder ::opts_knit$set(root.dir=paste0(direct, folder)) knitr
Create a variety of folders in your working directory, if statement ignores this if the folder exists already.
<- paste0(getwd(),"/raw_data")
raw_data_folder if (!file.exists(raw_data_folder)) {dir.create(raw_data_folder)}
<- paste0(getwd(),"/data")
data_folder if (!file.exists(data_folder)) {dir.create(data_folder)}
<- paste0(getwd(),"/results")
results_folder if (!file.exists(results_folder)) {dir.create(results_folder)}
<- paste0(getwd(),"/results1")
results_folder if (!file.exists(results_folder)) {dir.create(results_folder)}
<- paste0(getwd(),"/results2")
results_folder if (!file.exists(results_folder)) {dir.create(results_folder)}
<- paste0(getwd(),"/results3")
results_folder if (!file.exists(results_folder)) {dir.create(results_folder)}
<- paste0(getwd(),"/results4")
results_folder if (!file.exists(results_folder)) {dir.create(results_folder)}
<- paste0(getwd(),"/results_brt")
results_folder if (!file.exists(results_folder)) {dir.create(results_folder)}
<- paste0(getwd(),"/results_glm")
results_folder if (!file.exists(results_folder)) {dir.create(results_folder)}
<- paste0(getwd(),"/results_gam")
results_folder if (!file.exists(results_folder)) {dir.create(results_folder)}
<- paste0(getwd(),"/scripts")
scripts_folder if (!file.exists(scripts_folder)) {dir.create(scripts_folder)}
<- paste0(getwd(),"/predictors")
raw_data_folder if (!file.exists(raw_data_folder)) {dir.create(raw_data_folder)}
<- paste0(getwd(),"/predictors_future")
raw_data_folder if (!file.exists(raw_data_folder)) {dir.create(raw_data_folder)}
Set galah_config by adding your email. galah_config(email="your-email@email.com") This email needs to be registered with ALA. You can register here.
galah_config(email="?@gmail.com") #again put your email in here
Select an ATLAS. These configuration options will only be saved for this session. Set preserve=TRUE
to preserve them for future sessions.
show_all_atlases() #we will be using the Australian ATLAS data
galah_config(atlas="Australia")
show_all_fields() #show all the fields or columns of data within in the ALA
Find the kinds of data in each field replace the field name of interest in quotes.
search_fields("australian states and territories")
search_fields("coordinateUncertaintyinMeters")
search_fields("occurrenceID")
search_fields("cl22") #field values
search_taxa("Amphibia")
search_taxa(genus="Limnodynastes")
search_taxa(species="Limnodynastes peroni")
Download some occurence data. The galah_call
function starts a query, then galah_identify
selects the taxa you are interested in, galah_filter
selects records for specified columns and atlas_occurrences retrieves the specified occurrence records. Look for records submitted by FrogID:
search_field_values("datasetName") #note you can see more rows, click lower right below
First check the number of records your query will return, there are 67,000+ records in ALA.
galah_call() %>%
galah_identify("Limnodynastes peroni") %>%
atlas_counts()
Then filter those records so only those records from “FrogID” are returned for this species and remove records with a coordinate uncertainty greater than 100m, there are 36,000+ records from the FrogID dataset, and with only coordinates < 100m precision.
galah_call() %>%
galah_identify("Limnodynastes peroni") %>%
galah_filter(datasetName == "FrogID") %>%
galah_filter(stateProvince == "Queensland") %>%
galah_filter(coordinateUncertaintyInMeters < 100) %>%
atlas_counts()
Select the occurrence records, galah_select
returns wanted columns. We could also filter by year, but FrogID data is all pretty recent. Often you will want to ensure you are not including really old records in your data (i.e., galah_filter(year >= 2020)
).
<- galah_call() %>%
LiPe galah_identify("Limnodynastes peroni") %>%
galah_filter(datasetName == "FrogID") %>%
galah_filter(coordinateUncertaintyInMeters < 100) %>%
galah_filter(stateProvince == "Queensland") %>%
atlas_occurrences()
Get familiar with data; this will return the column names or fields.
head(LiPe)
names(LiPe)
write.csv(LiPe, "raw_data/Limnodynastes peroni.csv")
summary(LiPe) #generate a summary of those data
Drop columns not needed in analyses. Note you will often want to filter on additional fields not shown here.
<- LiPe[,c("decimalLatitude",
LiPe "decimalLongitude",
"eventDate",
"scientificName")]
head(LiPe) #look at the top rows of the new dataset
This is the earliest date, 2017 is fairly recent, and we want to make sure our predictor variables are available for these years.
min(LiPe$eventDate) #this is the earliest date, 2017 is fairly recent
max(LiPe$eventDate) #this is the most recent date, 2020
write.csv(LiPe, "data/Limnodynastes peroni.csv") save
Double check the directory within the chunks is correct, noting the directory outside the code chunks might be different (getwd()
). Next, read in data and overwrite LiPe:
<- read.csv(paste0(getwd(),"/data/Limnodynastes peroni.csv")) LiPe
FrogID records are pretty clean, but often when you map data you will see odd locations
map("world", xlim=range(LiPe$decimalLongitude),
ylim=range(LiPe$decimalLatitude))
points(LiPe[ , c("decimalLongitude", "decimalLatitude")], pch=".")
When you look at these records mapped, you can see that most records are concentrated where people are near the bigger cities of eastern Australiathis kind of sampling bias is common in Australian occurrence data. It takes so much more effort to sample in remote locations.
There are a few things you can do to reduce the problem of bias, you can break up your study area into areas near cities, and more remote areas (we will not do that here).
You can reduce the number of records close to one another. This reduces spatial autocorrelation, and is a good step for these kinds of records before we show how to do spatial - thinning (reducing how close records are together) we need a base raster layer
In all the modelling we will show here, each environmental predictor needs to cover the same area (have the same extent), have grid cells that are the same size (same resolution), and use the same coordinate system to define a method for turning a spherical planet earth into a flat map (same coordinate system and map projection, i.e., Coordinate Reference System or CRS)
A base layer is one with a CRS, extent and resolution that all other layers will be turned into. Generally it is a good idea to use the largest possible extent, i.e. you often get better predictions when you include the entire range of your species, and when you use the finest (smallest grid cells) possible. Smaller grid cells only help prediction when the values vary from one grid cell to the next grid cell. I recommed choosing the variable that is most closely related to the ecology of your species and has the smallest resolution (grid cell size). Which variable do you think will predict best where your species are found?
This is Enhanced Vegetation Index (EVI) which captures greenness index while correcting ndvi issues, from 2004 downloaded from the EcoCommons site.
<- raster("raw_data/ndlc-2004-250m_trend-evi-mean.tif")
EVI plot(EVI)
EVI
<- SpatialPoints(coords=cbind(LiPe$decimalLongitude,
LiPe_pts $decimalLatitude),
LiPeCRS(as.character("+proj=longlat +datum=WGS84 +no_defs")) )
require(rgeos)
<- rgeos::gConvexHull(LiPe_pts) LiPe_convH
Check if your convex hull looks correct by plotting useful information on working with spatial data in R.
plot(EVI)
lines(LiPe_convH)
Increase the size of your extent (study area to beyone the extent of your point data) our CRS is in decimal degrees so 1 degree ~ 111km larger.
<- rgeos::gBuffer(LiPe_convH, width=1) LiPe_extent
Check if your convex hull or extent is now larger
plot(EVI)
lines(LiPe_extent)
<- raster::crop(EVI,LiPe_extent)
EVI_LiPe plot(EVI_LiPe)
EVI_LiPe
<- raster::mask(EVI_LiPe,LiPe_extent)
EVI_LiPe_mask plot(EVI_LiPe_mask)
Create your reference raster, all other rasters will be set to this resolution, extent and CRS, dividing raster by itself sets all values to 1.
<- EVI_LiPe_mask/EVI_LiPe_mask
base crs(base) <- crs(EVI)
plot(base)
writeRaster(base,"data/base_LiPe.asc", overwrite=TRUE)
Run these lines if you are coming back to the script to load your base layer:
require(raster)
<- raster("data/base_LiPe.asc")
base plot(base) #check that it looks correct
Spatially thin occurrence records so that only one record is selected randomly from within a Grid cell that is 4 times larger than “base”. Aggregate reduce resolution (make grid cells larger) (factor=4) notice grid cells are now 4 times larger.
<- aggregate(base, fact=4)
large_base res(base)
res(large_base)
<- read.csv("data/Limnodynastes peroni.csv")
LiPe <- sp::SpatialPoints(coords=cbind(LiPe$decimalLongitude,
LiPe_pts $decimalLatitude),
LiPeCRS(as.character("+proj=longlat +datum=WGS84 +no_defs")) )
<- raster::extract(large_base,
cell_no
LiPe_pts,cellnumbers=TRUE)
head(cell_no)
<- cbind(LiPe,cell_no)
LiPe_cells head(LiPe_cells)
require(dplyr)
<- LiPe_cells %>%
LiPe_thinned group_by(cells) %>%
slice_sample(n=1)
length(LiPe_thinned$cells)
Plot to look at results and write results to file:
<- SpatialPoints(coords=cbind(LiPe_thinned$decimalLongitude,
LiPe_pts2 $decimalLatitude),
LiPe_thinnedCRS(as.character("+proj=longlat +datum=WGS84 +no_defs")))
plot(base)
points(LiPe_pts2,pch=20,cex=0.2)
write.csv(LiPe_thinned,"data/LiPe_thinned.csv")
Download all FrogID survey records to capture survey effort, spatial sampling bias, and to zero-fill.
galah_call() %>%
galah_identify("Amphibia") %>%
galah_filter(datasetName == "FrogID") %>%
galah_filter(coordinateUncertaintyInMeters < 100) %>%
galah_filter(stateProvince == "Queensland") %>%
galah_select("datasetName") %>%
atlas_counts()
<- galah_call() %>%
frogs galah_identify("Amphibia") %>%
galah_filter(datasetName == "FrogID") %>%
galah_filter(coordinateUncertaintyInMeters < 100) %>%
galah_filter(stateProvince == "Queensland") %>%
galah_select(datasetName,occurrenceID) %>%
atlas_occurrences()
head(frogs)
length(frogs$occurrenceID)
$unique_visit <- paste0(frogs$decimalLatitude,
frogs$decimalLongitude,
frogs$eventDate)
frogslength(unique(frogs$unique_visit))
length(unique(frogs$eventDate))
$visitID <- as.numeric(as.factor(frogs$unique_visit))
frogslength(unique(frogs$visitID))
head(frogs$visitID, 100)
head(frogs)
names(frogs)
<- frogs[,c(1,3:6,11)]
frogs2 head(frogs2)
getwd()
write.csv(frogs2,"raw_data/FrogID_all_ALA_Mar2022.csv")
require(dplyr)
names(frogs2)
Richness on each visit:
<- frogs2 %>%
frogs3 group_by(decimalLatitude,
%>%
decimalLongitude, visitID) summarise(no_spp=length(unique(scientificName)))
head(frogs3)
length(frogs3$decimalLatitude)
summary(frogs3$no_spp)
Calculate the number of visits to the same lat long location.
<- frogs3 %>%
frogs4 group_by(decimalLatitude, decimalLongitude) %>%
summarise(no_visits=length(visitID))
length(frogs4$decimalLatitude)
summary(frogs4$no_visits)
Note there were 448 visits to one lat long location, but most locations only had one visit.
require(sp)
<- SpatialPoints(coords=cbind(frogs4$decimalLongitude,
visits_pts $decimalLatitude),
frogs4CRS(as.character("+proj=longlat +datum=WGS84 +no_defs")))
Whoops we need to include only those points within our study area:
plot(base)
points(visits_pts,pch=20,cex=0.2)
This creates a raster that totals the total number of visits at each lat/long location
<- rasterize(visits_pts,
b1
base, $no_visits,
frogs4fun=sum,
background=0)
b1
This is one way to generate a bias file, simply make a layer where the value in each cell=the number of surveys done in that cell. Maxent will not work with values of zero in the bias file, so below we add 1 to each cell value in our study area (all values in base=1).
Here you can see that there are too few points to show up on a map. If unable to add the bias layer directly into Maxent (args=c("biaslayer=bias")
) - this method does not work for me and unfortunately means that when using R it is not using the FACTORBIASOUT
argument, but a work around is to account for bias within Maxent by using your bias grid to determine the probability of sampling that grid cell to generate your background.
Here most of the values in the grid are the same value of 1 which in our case means no sampling was done there. So in this example we have a number of sites with an increased probability of being selected as a background point in Maxent, but most of the study area has the same low probability of being selected as background (some people might choose to use this).
<- b1+base numb_visits_bias
Below we highlight an alternative method to generate a bias grid. If we assume areas close to areas where surveys have been conducted are more likely to have been surveyed, so there is a general spatial sampling bias, we may want to identify those spatial locations where data was more likely to have come from.
You can do this kind of thing with a focal function (see Step 2), but furthe below we generate a smoothed spatial surface where the probability of an area being sampled is a function of distance and sampling intensity. Using a Kernel density function is one way to do this.
<- bbox(b1)
bb <- raster::extract(b1, visits_pts,
visit_locations cellnumbers=TRUE)
These are the points that only occur in the study area
<- as.data.frame(na.omit(visit_locations))
visit_locations2 <- unique(visit_locations2$cells)
cellID <- xyFromCell(b1,cell=cellID)
xy_visits
<- c(bb[1,1],bb[1,2])
lg <- c(bb[2,1],bb[2,2])
lt <- cbind(lg,lt)
coordss <- rbind(xy_visits,coordss) v_xy2
For some of these functions kde2d, and xyFromCell I had to increase my local R-environ memory size. Open terminal > cd ~ touch .Renviron, open .Renviron > save the following as the first line of .Renviron: R_MAX_VSIZE=100Gb.
require(MASS)
<- kde2d(v_xy2[,1], v_xy2[,2], n=c(nrow(b1), ncol(b1)))
dens <- raster(dens)
b5 plot(b5)
<- projectRaster(b5,base)
b6 <- mask(b6,base)
b7 plot(b7)
writeRaster(b7,"data/Bias_LiPe_kd.asc",overwrite=TRUE)
Now lets clean up our global environment, and get rid some of these big files:
rm(b5)
rm(b6)
rm(EVI)
rm(EVI_LiPe)
rm(EVI_LiPe_mask)
rm(frogs2)
rm(frogs3)
rm(visit_locations)
To identify locations that were surveyed but where LiPe was not detected.
The FrogID project attempts to identify all frog species that were heard during a period of sound recording. While there is a possibility that frogs were present but not detected it is correct to assume that if the target species was not recorded on a certain date and time at a specific lat/long the there was a zero detection.
You can then make some assumptions around the number of surveys needed to be sure there were no frogs detected, while understanding that the spatial resolution of the modelling you are doing may include many frog habitats some of which may not have been surveyed. Still this method of generating pseudo-absences is at least putting zeros in areas where surveys were done, but the species was not detected.
If you relax the assumptions further, then species for which counts are usually inclusive of all the species detected (like bird lists), those cells where 10 (the number of surveys will vary by species dataset etc) surveys have been done but which did not detect the target species can be assumed to have zero of that species.
The raster layer we created and named b1 has the number of surveys at all the locations where frogs were surveyed. First, we are going to check how many cells have a value of 3 or higher. So is there enough data to produce pseudo-absence zeros if we assume that when three visits were done in a grid cell we should have identified our target species (here striped marsh frog).
In a perfect world we may have needed 10 surveys at each of the habitats within each of the grid cells in order to be certain that the frog is truly absent from this location. Here we have some evidence that the frog was not present, and we have the B1 raster which captures survey effort which is a useful covariate in occupancy modelling (something we won’t go into here).
Reclassify raster with number of surveys to 1 if more than 2 surveys were done, and zero otherwise
<- c(0, 2.9, 0, 2.9, 3247, 1)
m <- matrix(m, ncol= 3, byrow=TRUE)
reclass <- reclassify(b1, reclass) rc
Since all values are 1 or zero and ones are where there were three or more visits, cellStats gives us the number of grid cells where FrogID surveys were conducted (9196) in our case.
<- mask(rc,base)
visits_3or_more freq(visits_3or_more)
Create a vector of 1’s of the same length as the LiPe_pres
, then create a raster with a value of 1 for each gridcell where a LiPe
was recorded.
<- rep(1,length(LiPe$decimalLatitude))
LiPe_pres <- rasterize(LiPe_pts,base,
LiPe_pres_raster
LiPe_pres, fun=min,
background=0)
<- mask(LiPe_pres_raster,base) LiPe_pres_raster2
How many grid cells of the pre-thinned data had LIPE
recorded in them?
freq(LiPe_pres_raster2)
Confirm that your two raters are the same extent, CRS, resolution etc:
compareRaster(LiPe_pres_raster2,
visits_3or_more,extent=TRUE,
rowcol=TRUE,
crs=TRUE,
res=TRUE,
orig=TRUE,
rotation=TRUE,
values=FALSE)
Note there are 3179 locations with a value of -1 indicating the number of locations where LiPe was the only species observed in that grid cell, while 5617 cells have a value of 1, these are cells where frogs other than LiPe were recorded and we will consider these zeros from here on. This is not enough zeros to use as a targeted background in Maxent (usually best to go with 10,000 background locations), but it is enough for most other methods that require zeros BRT, GLM etc.
<- visits_3or_more - LiPe_pres_raster2
Zero_LiPe freq(Zero_LiPe)
Reclassify so a zero raster has the value of 1 for all LiPe zeros, and NA or 0 for all other values:
<- c(-2, 0.1, 0, 0.1, 2, 1)
m <- matrix(m, ncol= 3, byrow=TRUE)
reclass2 <- reclassify(Zero_LiPe, reclass2)
rc2 <- mask(rc2,base)
Zero_LiPe2 freq(Zero_LiPe2)
Extract the cell numbers from the 0 grid where the value ==1. These are the lat/lons for locations where at least three surveys were done, but zero LiPe were detected - these are our pseudo absences.
<- Which(Zero_LiPe2 ==1,cells=TRUE)
cell_vals_0 <- xyFromCell(Zero_LiPe2,
xy_zero_LiPe cell=cell_vals_0)
plot(base) #plot the absence locations
points(xy_zero_LiPe, pch=20,cex=0.2)
plot(base) #plot the presence locations
points(LiPe[ , c("decimalLongitude",
"decimalLatitude")], pch=".")
We will use these pseudo absences for BRT and GLM later.
<- as.data.frame(xy_zero_LiPe)
LiPe_zeros colnames(LiPe_zeros) <- c("Long","Lat")
$spp <- "Limnodynastes_peroni"
LiPe_zeros$pres <- 0
LiPe_zeroswrite.csv(LiPe_zeros,"data/LiPe_zero_locations_3vis.csv")
writeRaster(Zero_LiPe2,"data/LiPe_absences1_3vis.asc",
overwrite=TRUE)
<- read.csv("data/LiPe_zero_locations_3vis.csv") zeros_3visit
Reclassify raster with number of surveys to 1 if more than 1 surveys was done, and zero otherwise:
<- c(0, 0.9, 0, 0.9, 3247, 1)
m <- matrix(m, ncol= 3, byrow= TRUE)
reclass3 <- reclassify(b1, reclass3)
rc3 freq(rc3)
It is a good idea to check your work as you go. Plotting data, looking at freq are some ways to do this. Here we double check that the number of cells with values greater than 1 in our b1 raster match the number of 1’s in the reclassification
<- as.data.frame(freq(b1))
t1 sum(t1$count)-t1[1,2]
Sure enough, freq of rc3 returns 29161 cells with a value of 1, and the sum of cell counts with values greater than 1 in the original b1 raster are also 29161.
Why am I showing this? Because my first attempt at generating the rc3 raster did not have equal numbers using a different method, and a typo in my first reclassification call kept these numbers separate. It is just a good idea to always verify that what you think you did in your code do not have any errors.
<- mask(rc3,base)
visits_1or_more freq(visits_1or_more)
Confirm that your two raters are the same extent, CRS, resolution etc:
compareRaster(LiPe_pres_raster2,
visits_1or_more,extent=TRUE,
rowcol=TRUE,
crs=TRUE,
res=TRUE,
orig=TRUE,
rotation=TRUE,
values=FALSE)
<- visits_1or_more - LiPe_pres_raster2
Zero_LiPe1visit freq(Zero_LiPe1visit)
Here we have 22134 grid cells where another frog(s) was recorded but not our target spp. We only need 10,000 cells (usually) for a targeted background in Maxent, so we can afford to do some spatial thinning. (If we don’t have close to 10,000 grid cells after spatial thinning, we will just use the kde bias layer).
<- Which(Zero_LiPe1visit==1,cells=TRUE)
cell_vals_0_1
<- as.data.frame(xyFromCell(Zero_LiPe2,
xy_zero_LiPe_1vis cell=cell_vals_0_1))
colnames(xy_zero_LiPe_1vis) <- c("Long","Lat")
$spp <- "Limnodynastes_peroni"
xy_zero_LiPe_1vis$pres <- 0
xy_zero_LiPe_1viswrite.csv(xy_zero_LiPe_1vis,"data/LiPe_zero_locations_1vis.csv")
<- SpatialPoints(coords=cbind(xy_zero_LiPe_1vis$Long,
LiPe_pts2 $Lat),
xy_zero_LiPe_1visCRS(as.character("+proj=longlat +datum=WGS84 +no_defs")))
<- as.data.frame(raster::extract(large_base,
cell_no2
LiPe_pts2,cellnumbers=TRUE))
head(cell_no2)
Notice here I did not need to use a subsequent colnames function.
<- as.data.frame(cbind(Long=xy_zero_LiPe_1vis$Long,
LiPe_cells2 Lat=xy_zero_LiPe_1vis$Lat,
cell_no=cell_no2$cells))
head(LiPe_cells2)
require(dplyr)
<- LiPe_cells2 %>%
LiPe_zeros_thinned_1vis group_by(cell_no) %>%
slice_sample(n=1)
length(LiPe_zeros_thinned_1vis$cell_no)
Note this in this object there is still more, than 10,000 areas where frog surveys were done, but no LiPe were recorded. We might see how many cells we would have if there were two or more visits, and then thin that result, but we will try to use these and the points from the bias layer to fit our models.
There are trade-offs with every modelling decision, some are important for your result, some do not really impact the result. Your decisions on what is best for your model depends on your question, the available data, the ecology of your species, and an understanding of what has worked or is important according to the literature for your species and your kind of question.
$spp <- "Limnodynastes_peroni"
LiPe_zeros_thinned_1vis$pres <- 0
LiPe_zeros_thinned_1vishead(LiPe_zeros_thinned_1vis)
write.csv(LiPe_zeros_thinned_1vis,
"data/LiPe_zero_locations_thinned_1vis.csv")
Note if you just use head(LiPe_zeros_thinned_1vis), you won’t see all the decimal points in Long and Lat, print.data.frame shows all the decimal
print.data.frame(head(LiPe_zeros_thinned_1vis))
It is often a good idea to break your data into randomly selected training and testing data. Often you would randomly remove 20% or more of your data and withold that from the model building process. Once your model was finsihed using your training data, you would then test your model with this training data.
If you have very little data, bootstrapping can be used to see if removal of a small percentage of data repeatedly changes results. This gives you a good understanding of the confidence intervals around your results and can reduce the impact of outliers on your final result. Cross-validation requires more data. Ideally for cross-validation you break your data into 10 folds (subsets) and compare results between folds, or summarise variability between folds. Some people select folds spatially, with each fold an independent spatial block of data, but random folds seem to work just as well or better.
If you can afford to set aside 20% + of your data, withholding a test data set is good practice. Ideally, you will test your model with completely independent data.
This is a soil wetness index from 2013 downloaded from the EcoCommons site.
Simply enter the above url, download the file, and load it from the directory path you saved it - ideally you would match the time the layer was made to the time your occurrence data was collected, and you might average over the last ten years, but this is just an example of the kinds of data you can use the EcoCommons point-and-click environment does this for you when in the SDM workflow you can select to make all the variables the finest or coarsest resolution.
<- raster("raw_data/awap_WRel1End.tif")
AWAP1 plot(AWAP1)
AWAP1<- raster("data/base_LiPe.asc")
base crs(base) <- "+proj=longlat +datum=WGS84 +no_defs"
Reduce the resolution of this layer on soil wetness, and match the extent of base. Note there are a variety of ways to this. In some cases these decisions on how to reduce or increase resolution matter to the result. Often the defaults are fine. Know your data, and dig deeper into the methods underlying these functions try typing ??raster::projectRaster
.
<- projectRaster(AWAP1,base)
AWAP2 <- mask(AWAP2,base)
AWAP3 plot(AWAP3)
<- AWAP3 AWAP
Now we want to be sure the name of the variable AWAP is the same in memory, and in a folder full of all predictors so we can predict our fitted model into the entire area possibly using other point occurrence data (independent data). Use writeRaster()
to save files locally:
writeRaster(AWAP,"predictors/AWAP.asc")
writeRaster(AWAP,"predictors_future/AWAP.asc")
Lets now upload the Vegetation classification layer NVIS from EcoCommons.
<- raster("raw_data/nvis-2020-90m_aus6_0e_mvg_amvg.tif") NVIS1
Each grid cell value should correspond to one of these categories.
freq(NVIS1)
freq(NVIS1,value=44)
Habitat of frog - it is hard to see one of these classes being super helpful, class 44 freshwater did not return anything. If we did use it we need to be careful changing the resolution because these are categorical variables (use nearest neighbor when changing resolution).
Random useful tidbit: A handy function to clip points to a polygon
<- spatial_point_file[polygon_file, ] clipped_points
to get a wetland file. The steps below might not work, alternatively load wet_cov.asc file from the scripts folder
These steps take too long to run here, so we supply the wetland layer, steps are below download an Australia wetland shapefile from here, select 1 0.1 degree grid that covers Australia, read in Wetland polygones
<- st_read("SurfaceHydrologyPolygonsNational.gdb") wetland
then turn polygons into raster. The argument getCover in the rasterize function calculates the area of each cell covered by wetland
<- rasterize(wetland,Point1_degree_grid,getCover=TRUE) wet_cov
for this variable it would have made more sense to source a polygon layer of freshwater only wetland. Here we are simply going to download the wet_cov.asc:
<- raster("https://drive.google.com/file/d/1mUWoc7f8nuSw4cmdwsxiWwt5gdXVohOb/view?usp=sharing") wet_cov
<- raster('raw_data/wet_cov.asc')
wetland1 crs(wetland1) <- "+proj=longlat +datum=WGS84 +no_defs"
<- projectRaster(wetland1,base)
wetland2 <- mask(wetland2,base)
wetland3 plot(wetland3)
Here we use the focal function (a neighborhood function) to transform each grid cell value to the sum of the central cell and all cells 5 cells away surrounding that central cell this is a surrogate of connectivity, isolated wetlands will have lower values.
<- focal(wetland3, w=matrix(1, nrow=11, ncol=11),fun=sum,na.rm=TRUE)
wetland_connectivity plot(wetland_connectivity)
Write rasters if needed:
writeRaster(wetland_connectivity,
"predictors/wetland_connectivity.asc")
writeRaster(wetland_connectivity,
"predictors_future/wetland_connectivity.asc")
Data comes in many formats, and uploading often requires different mehtods NetSDF files are increasingly common, below is a sript to bring in that file type
First we load days of maximum temperature data with that subset of data having the varname=“tmax”
require(sf)
require(ncdf4)
require(rasterVis)
require(raster)
<- brick("Terraclim_EY_NSW.nc",
MXtemp varname="tmax") #this NetCDF file includes many
<- mean(MXtemp) #if we just want one layer which is the mean of those daily totals MXtemp_mean
Days of maximum temperature data with that subset of data having the varname="tmax"
.
We will now download a vegetation greenness index from EcoCommons.
<- raster("https://api.data-ingester.app.ecocommons.org.au/api/data/34ab5ea6-650f-503f-9446-88d0ae9effe1/download/ndlc-2004-250m_trend-evi-mean.tif") EVI1
Process it as previous layers:
<- projectRaster(EVI1,base)
EVI2 <- mask(EVI2,base)
EVI3 plot(EVI3)
<- EVI3
EVI
writeRaster(EVI,"predictors/EVI.asc")
writeRaster(EVI,"predictors_future/EVI.asc")
Here we just show how to bring in current and future data from EcoCommons after looking at correlations between bioclim variables this one was dropped.
<- raster("https://api.data-ingester.app.ecocommons.org.au/api/data/90317596-ddef-5666-91c5-9cbc25c24fbc/download/current_1976-2005_bioclim-01.tif")
Bioclim01_1 <- projectRaster(Bioclim01_1,base)
Bioclim01_2 <- mask(Bioclim01_2 ,base)
Bioclim01_3 plot(Bioclim01_3)
<- Bioclim01_3
Bioclim01 writeRaster(Bioclim01,"predictors/Bioclim01.asc")
We will not write the bioclim data to the predictions_future folder because we have separate future climate predictions we can use. We do not have future predictions for EVI or Wetlands so we are assuming those things will stay the same in the future, the names of the future variables need to be the same as the current time variable names in order for the predict function to work, so just be sure to keep the different variables with the same name in different folders.
<- raster("https://api.data-ingester.app.ecocommons.org.au/api/data/cc081daa-f524-58c2-939e-166a2b2e79eb/download/current_1976-2005_bioclim-05.tif")
Bioclim05_1 <- projectRaster(Bioclim05_1, base)
Bioclim05_2 <- mask(Bioclim05_2, base)
Bioclim05_3 plot(Bioclim05_3)
<- Bioclim05_3
Bioclim05 writeRaster(Bioclim05,"predictors/Bioclim05.asc")
<- raster("https://api.data-ingester.app.ecocommons.org.au/api/data/476e4343-99f2-578e-b44a-951a55c6b7c2/download/current_1976-2005_bioclim-06.tif")
Bioclim06_1 <- projectRaster(Bioclim06_1,base)
Bioclim06_2 <- mask(Bioclim06_2,base)
Bioclim06_3 plot(Bioclim06_3)
<- Bioclim06_3
Bioclim06 writeRaster(Bioclim06,"predictors/Bioclim06.asc")
<- raster("https://api.data-ingester.app.ecocommons.org.au/api/data/b2d70413-6b74-5366-b5d5-82ed919ded93/download/current_1976-2005_bioclim-12.tif")
Bioclim12_1 <- projectRaster(Bioclim12_1,base)
Bioclim12_2 <- mask(Bioclim12_2,base)
Bioclim12_3 plot(Bioclim12_3)
<- Bioclim12_3
Bioclim12 writeRaster(Bioclim12,"predictors/Bioclim12.asc")
rm(Bioclim05_1,
Bioclim05_2,
Bioclim05_3,
Bioclim06_1,
Bioclim06_2,
Bioclim06_3,
Bioclim12_1,
Bioclim12_2, Bioclim12_3)
After looking at correlations between bioclim variables this one was dropped.
<- raster("https://api.data-ingester.app.ecocommons.org.au/api/data/e64dfa1d-ece8-50e7-9ed1-d603bfb7a101/download/current_1976-2005_bioclim-14.tif")
Bioclim14_1 <- projectRaster(Bioclim14_1,base)
Bioclim14_2 <- mask(Bioclim14_2 ,base)
Bioclim14_3 plot(Bioclim14_3)
<- Bioclim14_3
Bioclim14 writeRaster(Bioclim14,"predictors/Bioclim14.asc")
Create a raster stack of current data, be careful not to include the future variables below in this stack (they have the same names)
<- stack(AWAP,wetland_connectivity,EVI,Bioclim05,Bioclim06,Bioclim12)
preds_current plot(preds_current)
names(preds_current)
<- setNames(preds_current,c("AWAP","wetland_connectivity",
preds_current2 "EVI","Bioclim05","Bioclim06","Bioclim12"))
names(preds_current2)
rm(preds_current)
Repeat with future climate data, from EcoCommons Australia, Climate Projection, SRESA1B based on INM-CM30, 30 arcsec (~1km) - 2085 after looking at correlations between bioclim variables this one was dropped.
<- raster("https://api.data-ingester.app.ecocommons.org.au/api/data/4e5534eb-c743-52af-a9c3-d137df8ba8c1/download/SRESA1B_inm-cm30_2085_bioclim-01.tif")
Bioclim01_1 <- projectRaster(Bioclim01_1,base)
Bioclim01_2 <- mask(Bioclim01_2 ,base)
Bioclim01_3 plot(Bioclim01_3)
<- Bioclim01_3
Bioclim01 writeRaster(Bioclim01,"predictors_future/Bioclim01.asc")
<- raster("https://api.data-ingester.app.ecocommons.org.au/api/data/34fe63e3-4152-5406-96f7-dd67f54be5f5/download/SRESA1B_inm-cm30_2085_bioclim-05.tif")
Bioclim05_1 <- projectRaster(Bioclim05_1,base)
Bioclim05_2 <- mask(Bioclim05_2,base)
Bioclim05_3 plot(Bioclim05_3)
<- Bioclim05_3
Bioclim05 writeRaster(Bioclim05,"predictors_future/Bioclim05.asc")
<- raster("https://api.data-ingester.app.ecocommons.org.au/api/data/36cf3700-d238-511c-bb8b-bc3c3677309d/download/SRESA1B_inm-cm30_2085_bioclim-06.tif")
Bioclim06_1 <- projectRaster(Bioclim06_1,base)
Bioclim06_2 <- mask(Bioclim06_2,base)
Bioclim06_3 plot(Bioclim06_3)
<- Bioclim06_3
Bioclim06 writeRaster(Bioclim06,"predictors_future/Bioclim06.asc")
<- raster("https://api.data-ingester.app.ecocommons.org.au/api/data/449cee85-5f6a-5b25-aca8-0fd2838be09f/download/SRESA1B_inm-cm30_2085_bioclim-12.tif")
Bioclim12_1 <- projectRaster(Bioclim12_1,base)
Bioclim12_2 <- mask(Bioclim12_2,base)
Bioclim12_3 plot(Bioclim12_3)
<- Bioclim12_3
Bioclim12 writeRaster(Bioclim12,"predictors_future/Bioclim12.asc")
rm(Bioclim05_1,Bioclim05_2,Bioclim05_3,Bioclim06_1,Bioclim06_2,
Bioclim06_3,Bioclim12_1,Bioclim12_2,Bioclim12_3)
After looking at correlations between bioclim variables this one was dropped
<- raster("https://api.data-ingester.app.ecocommons.org.au/api/data/178ee1bc-0aa1-58a0-abaf-ff5e19fcf363/download/SRESA1B_inm-cm30_2085_bioclim-14.tif")
Bioclim14_1 <- projectRaster(Bioclim14_1,base)
Bioclim14_2 <- mask(Bioclim14_2 ,base)
Bioclim14_3 plot(Bioclim14_3)
<- Bioclim14_3
Bioclim14 writeRaster(Bioclim14,"predictors_future/Bioclim14.asc")
Again just be sure to run this in order so you are not getting current bioclim data mixed with the future data which has the same name
<- stack(AWAP,
preds_future
wetland_connectivity,
EVI,
Bioclim05,
Bioclim06,
Bioclim12)names(preds_future)
<- setNames(preds_current,
preds_future2 c("AWAP",
"wetland_connectivity",
"EVI",
"Bioclim05",
"Bioclim06",
"Bioclim12"))
names(preds_future2)
preds_current2rm(preds_future)
Here we will show you how to take a mean of many months of data if you have many months of data in one folder, perhaps you are exploring what a suitable niche looks like during a drought. You may then want climate like averages like precipitation, or average temperature during the drought months.
One way to do this if you have separate rasters fir each month in one folder.
<- list.files(pattern='.tif$', all.files=TRUE) #if there are 30 files of tifs from January, this creates a list of those files
Raster_list <- stack(raster_list) #reads files in, makes them into a raster stack
all_chl <- mean(all_chl) mean_chl
This produces one raster where each cell value is the mean value of all the rasters that were in the that folder
NETCDF files also are a common way for large volumes of data to be this is an example loop using Terraclimate data read NetCDF data file obtained from “TerraClimate” import netCDF file - data from TerraClimate python download. Download the file here.
require(sf)
require(ncdf4)
require(rasterVis)
require(raster)
setwd("~/Documents/Use_cases/EY_frogs/data") #directory where this NETCDF file is stored
<- brick("Terraclim_EY_NSW.nc", varname="tmax")
MXtemp <- mean(MXtemp)
Mxtemp_mean plot(Mxtemp_mean)
writeRaster(Mxtemp_mean,"MXtemp_TERRA_Sydney_region.asc",
overwrite=TRUE)
<- brick("Terraclim_EY_NSW.nc", varname="ppt")
Rain <- mean(Rain)
Rain_mean plot(Rain_mean)
writeRaster(Rain_mean,"Rain_TERRA_Sydney_region.asc",
overwrite=TRUE)
<- brick("Terraclim_EY_NSW.nc", varname="tmin")
MNtemp <- mean(MNtemp)
MNtemp_mean plot(MNtemp_mean)
writeRaster(MNtemp_mean,"MNtemp_TERRA_Sydney_region.asc",
overwrite=TRUE)
<- brick("Terraclim_EY_NSW.nc", varname="soil")
Soil <- mean(Soil)
Soil_mean plot(Soil_mean)
writeRaster(Soil_mean,"Soil_wetness_TERRA_Sydney_region.asc",
overwrite=TRUE)
crs(Soil_mean)
Fit a Maxent model and then fit a BRT model. Java will need to be installed in your local machine and the Maxent files will need to be in your local directory. Download Maxent from here. Ensure these files are in your local directory: maxent.jar maxent.bat maxent.sh
::p_load(dismo, gbm, raster, sf, rJava, boot, jpeg, MASSExtra, mgcv) pacman
Read occurrence data which was thinned for LiPe, and check if Maxent is in the folder
<- read.csv("data/LiPe_thinned.csv", stringsAsFactors=FALSE)
LiPe_thinned head(LiPe_thinned)
<- paste(system.file(package="dismo"),
jar "/java/maxent.jar", sep='')
if(file.exists(jar)) { cat("can continue, maxent is available")}
else{cat('cannot run this because maxent is not available')}
setwd("~/Documents/Use_cases/SDM_in_R/predictors")
<- list.files(pattern='.asc$', all.files=TRUE)
rast_lst
rast_lst<- stack(rast_lst)
LiPe_predictors crs(LiPe_predictors) <- "+proj=longlat +datum=WGS84 +no_defs"
LiPe_predictors
Earlier runs of this function indicated BioClim01 and BioClim14 were highly correlated with other variables, so they were dropped. With this set of variables we still see high correlations between BioClim 12, AWAP and BioClim 05. However, all correlation are below 0.80, and for the purposes of prediction these correlations are not too high for looking at comparisons of the importance of variables these correlations are still probably a bit high.
<- function(x, y, digits=2, prefix="", cex.cor, ...)
panel.cor
{<- par("usr"); on.exit(par(usr))
usr par(usr=c(0, 1, 0, 1))
<- abs(cor(x, y, use="pairwise.complete.obs"))
r <- format(c(r, 0.123456789), digits=digits)[1]
txt <- paste(prefix, txt, sep="")
txt if(missing(cex.cor)) cex.cor <- 0.8/strwidth(txt)
text(0.5, 0.5, txt, cex=cex.cor * r)
}<- randomPoints(LiPe_predictors,1000)
rpoints <- extract(LiPe_predictors,rpoints)
samp pairs(samp,lower.panel=panel.smooth,upper.panel=panel.cor)
<- LiPe_thinned$decimalLongitude
x_LiPe <- LiPe_thinned$decimalLatitude
y_LiPe <- cbind(x_LiPe,y_LiPe)
xy_LiPe <- SpatialPoints(xy_LiPe)
xy.sp_LiPe crs(xy.sp_LiPe) <- crs(LiPe_predictors)
crs(xy.sp_LiPe)
plot(LiPe_predictors[[1]])
points(xy.sp_LiPe,pch=20,cex=0.2)
<- LiPe_predictors[[2]]
mask_r <- raster::extent(LiPe_predictors) ext
Might not need this step
setwd("~/Documents/Use_cases/Marine/Yellowtail_R")
<- paste0(getwd(), "/output")
output_file if (!file.exists(output_file)) {
dir.create(output_file)}
<- c('removeduplicates=TRUE',
maxent_args 'jackknife=TRUE','replicates=3',
'replicatetype=crossvalidate',
'replicatetype=crossvalidate',
'betamultiplier=1',
'responsecurves=TRUE',
'pictures=TRUE','plots=TRUE',
'defaultprevalence=0.5',
'outputformat=raw')
<- c('removeduplicates=TRUE',
maxent_args 'jackknife=TRUE',
'responsecurves=TRUE',
'outputformat=cloglog',
'plots=TRUE',
'outputgrids=TRUE',
'applythresholdrule=Maximum training sensitivity plus specificity')
We then choose which arguments we want to use, there are many scientific publications that look at which arguments to use when a good place to start to understand what Maxent is doing can be found here, and this is an excellent overview of the process and the things to consider.
How you fit your model depends on your question and your data: 1. Remove duplicates is good to make sure is true, just in case, but our thinning step should have removed all duplicates. 2. Jackknife gives an indication of how much each variable contributes to the results. 3. Cloglog format is usually what gives optimal predicted values * a. outputformat=cloglog b. * plots=TRUE gives a variety of plots in the output (results1) folder
A complete list of args that can be used can be found below. This is an argument that is explored alot in the literature *‘betamultiplier=1’, multiply all automatic regularization parameters by this number. A higher number gives a more spread-out distribution.
<- c('removeduplicates=TRUE',
maxent_args 'jackknife=TRUE',
'responsecurves=TRUE',
'plots=TRUE')
This is the model fitting function, and path can be tricky, but when set correctly you get many results written to this folder
<- dismo::maxent(LiPe_predictors,
Mod1_LiPe
xy.sp_LiPe, path="results",
args= maxent_args)
If you open up the results 1 folder and click on “maxent.html” - a webpage should open up showing some results.
The ROC curve and AUC score < 0.9 suggests we are not predicting our training super accurately - if the goal of this exercise is prediction I would probably stop here and make sure My occurrence data is accurate, and think about what other variables might help us predict the presence of this species. Also, I would run the model multiple times.
Machine / stochatic learning algorithms often result in different results each time. It is a good idea to run the model multiple times to get a sense of the variation. We might get better predictions with a wetland layer that only included freshwater wetland, perhaps averaging AWAP soil wetness layers over a decade would help? Perhaps there is a national layer that captures small wetlands better?
Jackknife results, and variable importance do not indicate in my results that EVI is a good predictor, so we might look to drop that one from future models. Note those Jackknife results and variable response plots are also in the plots folder in the results folder. Try deleting the results in results1 folder, and running the model again, Are results exactly the same? WHY NOT?
Now lets predict a gridcell value for each cell in our extent
<- predict(Mod1_LiPe, LiPe_predictors,args=maxent_args)
map_predictions plot(map_predictions)
points(xy.sp_LiPe,pch=20,cex=0.2)
If you want to save a jpg of the mapped predictions use the code below
require("jpeg")
setwd("~/Documents/Use_cases/SDM_in_R") #check your working direction with getwd()
jpeg(paste0(getwd(),"/results/LiPe_predicted.jpeg"))
plot(map_predictions)
points(xy.sp_LiPe,pch=20,cex=0.2)
dev.off()
This threshold maximises the cases where the model incorrectly assigns unsuitable habitat (true negative) and misses suitable habitat (false positive). Keep in mind how you set your threshold will depend on your use of the resulting map, or your question.
<- read.csv("results/maxentResults.csv")
results1 <- results1$Maximum.training.sensitivity.plus.specificity.area
thresh
thresh
<- c(0, thresh, 0, thresh, 1, 1)
m <- matrix(m, ncol= 3, byrow= TRUE)
reclass <- reclassify(map_predictions, reclass)
rc
plot(rc)
points(xy.sp_LiPe,pch=20,cex=0.2)
What if someone gave us $1million to buy some LiPe habitat? We might want to be really sure the species is present, if that is the case we might want to find a model with AUC > 0.9, but just to illustrate the result lets try a much higher threshold.
First we want to rescale the data so values run between 0 and 1, then select a threshold of 0.8. There are many reasons and ways to set thresholds, this is just one example of selecting a more conservative threshold that identifies locations where we are relatively more sure are suitable.
<- cellStats(map_predictions,"min")
min.R <- cellStats(map_predictions,"max")
max.R
<- ((map_predictions - min.R) /
map_predictions2 - min.R) - 0 ) * 1
(max.R <- 0.8
thresh2
<- c(0, thresh2, 0, thresh2, 1, 1)
m2 <- matrix(m2, ncol= 3, byrow= TRUE)
reclass2 <- reclassify(map_predictions2, reclass2)
rc2
plot(rc2)
There are far fewer locations where we are pretty sure someone will find LiPe, and what if our original model is not that good, how can we assess this? Now we are going to explore how consistent results are with a three fold-cross validation:
When using cross validation as the replicate type: - * replicatetype=crossvalidate - * replicates=3
<- c('removeduplicates=TRUE',
maxent_args2 'jackknife=TRUE',
'responsecurves=TRUE',
'outputformat=cloglog',
'plots=TRUE',
'replicatetype=crossvalidate',
'replicates=3')
This is the model fitting function, and path can be tricky, but when set correctly you get many results written to this folder.
<- dismo::maxent(LiPe_predictors, xy.sp_LiPe, path="results2", args= maxent_args2) Mod2_LiPe
Notice that in the folder results2 the differences between the three models (maxent_0.html, maxent_2.html, maxent_2.html) are very similar (likely due to large numer of occurrence records). What other way could you compare the performance of these three models, and how?
Now let briefly explore how to account for sampling bias:
<- raster("data/Bias_LiPe_kd.asc")
bias_raster crs(bias_raster) <- "+proj=longlat +datum=WGS84 +no_defs" #always double check you have defined the CRS correctly
plot(bias_raster) #and a good idea to double check that it looks like it should
<- xyFromCell(bias_raster, sample(which(!is.na(values(bias_raster))),
bg 10000,
prob=values(bias_raster)[!is.na(values(bias_raster))]))
=c('biasfile=bias_raster') args
Some documentation indicates that you could use this argument to allow Maxent to correct for bias (looks like a different method though) I could not get this argument to work though.
<- c('removeduplicates=TRUE',
maxent_args 'jackknife=TRUE',
'responsecurves=TRUE',
'outputformat=cloglog',
'plots=TRUE')
<- dismo::maxent(x=LiPe_predictors,
Mod3_LiPe p=xy.sp_LiPe,a=bg,
path="results3",
args= maxent_args)
<- predict(Mod3_LiPe, LiPe_predictors)
map_predictions3
plot(map_predictions3)
points(xy.sp_LiPe,pch=20,cex=0.2)
Compare these results, what else might we do to compare results between models? There are some pretty big changes, and the model fitting does not give as high an AUC, Why? What if we had not used thinned data? Would the AUC have been higher? Why? Now lets try using targeted background points:
LiPe_zeros_thinned_1vis
LiPe_thinned<- read.csv("data/LiPe_zero_locations_thinned_1vis.csv") zeros_1visit
These are the locations where we assume there were no LiPe in that cell because at least one.
<- zeros_1visit$Long
x_bk2 <- zeros_1visit$Lat
y_bk2 <- cbind(x_bk2,y_bk2)
xy_bk2 <- SpatialPoints(xy_bk2)
xy.sp_bk2 crs(xy.sp_bk2) <- crs(LiPe_predictors)
crs(xy.sp_bk2)
plot(LiPe_predictors[[2]])
points(xy.sp_bk2,pch=20,cex=0.2)
Notice there has been alot of inland locations where frogs were surveyed for, but which did not record LiPe.
<- c('removeduplicates=TRUE',
maxent_args 'jackknife=TRUE',
'responsecurves=TRUE',
'outputformat=cloglog',
'plots=TRUE')
<- dismo::maxent(x=LiPe_predictors,
Mod4_LiPe p=xy.sp_LiPe,
a=xy.sp_bk2,
path="results4",
args= maxent_args)
<- predict(Mod4_LiPe, LiPe_predictors)
map_predictions4
plot(map_predictions4)
points(xy.sp_LiPe,pch=20,cex=0.2)
Notice our AUC values have fallen again and while the response curves look similar, variable importance has shifted. There is a clear difference between the overall background, and a targeted background that only uses environmental variables where someone has done a frog survey should we consider excluding desert areas, where LiPe is not observed from our study area, or extent? Will we get more realistic estimates of model performance, and of the subtle differences between the areas LiPe occurs and where is does not if we focus our extent to only occur areas where they might occur.
The map does not look radically different to the map predictions from model 1, but the low AUC is highligting how the it is hard to distinguish between presence locations and nearby locations where LiPe was not observed. This is due to the high spatial autocorrelation in our predictor values and in our sampling effort. This is a better AUC, and indicates there is much work to do to predict well.
What if we predict our current model into the future?
setwd("~/Documents/Use_cases/SDM_in_R/predictors_future")
<- list.files(pattern='.asc$', all.files=TRUE)
rast_lst2
rast_lst2<- stack(rast_lst2)
LiPe_predictors_future crs(LiPe_predictors_future) <- "+proj=longlat +datum=WGS84 +no_defs"
LiPe_predictors_future
<- predict(Mod4_LiPe, LiPe_predictors_future)
map_predictions5
plot(map_predictions5)
points(xy.sp_LiPe,pch=20,cex=0.2)
Any differences expected in the future map (slight difference with retraction away from inland and northern areas, possible suitability increase at higher elevations.
How confident are we? Was it a problem that we used one source of climate model to represent current climate and another source to represent future climate?
Now we will fit a Boosted Regression Tree (BRT) model, which is essentially a generalized boosting model (gbm).
First, lets add our data where no LiPe were found on three visits:
<- read.csv("data/LiPe_zero_locations_3vis.csv")
LiPe_zero_3vis head(LiPe_zero_3vis)
$X <- NULL
LiPe_zero_3vis$spp <- NULL
LiPe_zero_3vishead(LiPe_zero_3vis)
<- as.data.frame(cbind(Long=LiPe_thinned$decimalLongitude, Lat=LiPe_thinned$decimalLatitude, pres=LiPe_thinned$layer))
LiPe_thin <- LiPe_thin[complete.cases(LiPe_thin), ]
LiPe_thin2 head(LiPe_thin2)
<- rbind(LiPe_zero_3vis,LiPe_thin2)
LiPe_p_a
<- LiPe_p_a$Long
x_pa <- LiPe_p_a$Lat
y_pa <- cbind(x_pa,y_pa)
xy_pa <- SpatialPoints(xy_pa)
xy.sp_pa crs(xy.sp_pa) <- crs(LiPe_predictors)
crs(xy.sp_pa)
<- extract(LiPe_predictors,xy.sp_pa)
LiPe_preds <- cbind(LiPe_p_a,LiPe_preds)
LiPe_preds2 <- as.data.frame(LiPe_preds2[complete.cases(LiPe_preds2), ])
LiPe_preds3 summary(LiPe_preds3)
names(LiPe_preds3)
Note in gbm.step
you specify the column index location, so pres is indexed as 3 in names.
These models use stochastic learning, so you can expect a different answer each time unless you set the random seed, or make them deterministic by setting bag fraction to 1.
<- gbm.step(data=LiPe_preds3,
LiPe_mod5_brt gbm.x=c(4:9),
gbm.y=3,
family="bernoulli",
tree.complexity=3,
learning.rate=0.01,
bag.fraction=0.75)
Notice the plot that is produced automatically that shows the drop in holdout deviance as the number of trees increases, we are looking to explain more deviance with each tree.
Ideally this curve would not decrease quickly at the start, there would be gentler curve. Steep drops early in the model are often less stable models. Reducing the learning rate from 0.01 to 0.003 might lessen the drop, but this one is not too bad, and it would take much longer to run if we slow the learning rate down.
Elith et al. provides general guidelines on where to start, and suggests a variety of ways to select optimal settings. Keep in mind a tree complexity of 3 suggests you suspect that three way interactions make sense ecologically. Perhaps the suitability of frog ponds relates to how hot it has been, what the soil moisture was like and how much rain occurred?
Also, simply running this model again, and comparing results will give you some indication of how stable your model is there are alot of interesting things stored in the model object, and you can browse through them.
This shows all the other default settings as well as those specified in the call:
names(LiPe_mod5_brt)
$gbm.call LiPe_mod5_brt
Lets predict and plot results, this step takes a couple hours. This predict function will use the n.trees=to the best trees number found in the original model check it using:
<- predict(LiPe_predictors,LiPe_mod5_brt,type="response")
pred_map_brt $gbm.call$best.trees LiPe_mod5_brt
3000 trees will be used, you can choose another value if you have a good reason too.
require("jpeg")
setwd("~/Documents/Use_cases/SDM_in_R")
jpeg(paste0(getwd(),"/results_brt/LiPe_brt_predicted.jpeg"))
plot(pred_map_brt)
points(xy.sp_LiPe,pch=20,cex=0.2)
dev.off()
writeRaster(pred_map_brt, "results_brt/LiPe_brt_predicted.asc")
There are alot more options to explore with BRT curious what a function does (i.e., run ??gbm.plot
).
jpeg(paste0(getwd(),"/results_brt/LiPe_brt_var_response_curves.jpeg"))
gbm.plot(LiPe_mod5_brt,rug=TRUE) #generates response curves
dev.off()
gbm.plot.fits(LiPe_mod5_brt)
gbm.interactions(LiPe_mod5_brt)
<- as.data.frame(LiPe_mod5_brt$contributions)
BRT_var_import write.csv(BRT_var_import,"results_brt/LiPe_brt_var_importance.csv")
var importance
suggests two variables don’t add much, lets look at possibly dropping those two:
<- gbm.simplify(LiPe_mod5_brt, n.drops=4) brt_drops
If we look at the graph produced by gbm.simplify we can see a red verticle line at 2 variables removed
gbm uses boosting to focus on improving fit (deviance) on each iteration, but if it is focussing on a variable that does not help, it actually increases deviance by having that extra predictor in the model.
A better predictive model might result from dropping 1 or 2 variables.
names(LiPe_preds3)[brt_drops$pred.list[[1]]]
Compare that to the full list used in the model:
names(LiPe_preds3)[c(4:9)]
Not surprisingly this indicates that the variable EVI which contributes least can be dropped, Ecologically it might be a stretch to think that wetlands for frogs will be related to how green vegetation is in an area, so it might make sense to drop it.
Note few people worry about this step, but ecologists like simpler models, and keeping EVI actually increases predictive deviance, so dropping makes sense.
names(LiPe_preds3)[brt_drops$pred.list[[2]]]
If we were to drop two variables it would be EVI and wetland connectivity. We know wetland connectivity includes salty wetlands which are no good for frogs and the layer does not include most of the small wetlands used by frogs, so ecologically might make sense to drop it, and a slight increase in predictive deviance when included. Given my objective is to maximise predictions, I would drop both.
A short cut for running another model with those two variables dropped is below, notice the diffent way to call gbm.x
.
<- gbm.step(data=LiPe_preds3,
LiPe_mod6_brt gbm.x=brt_drops$pred.list[[2]],
gbm.y=3, family="bernoulli",
tree.complexity=3,
learning.rate=0.01,
bag.fraction=0.75)
Other plotting methods, evaluation, bootstrapping or ways to optimise the selection of the number of trees can be found here.
Elith J, Leathwick J R, & Hastie T (2008). Boosted regression trees - a new technique for modelling ecological data. Journal of Animal Ecology.
Now we will very quickly fit a GLM from the package MASSExtra
:
<- glm(pres ~ AWAP + Bioclim05 + Bioclim06 + Bioclim12 + EVI+ wetland_connectivity, family=binomial, data=LiPe_preds3)
glm1
glm1summary(glm1)
This is a great new function which simplifies variable selection (remember the days of forward, backward) stepAIC is often good, stepBIC uses a different metric.
<- step_BIC(glm1) glm1a
This suggests our best model includes AWAP, Bioclim 05, and Bioclim06.
Check residual plots, interpretation of residual plots for binomial family is different to other residual plot diagnostics, hear there appears to be little violation of linearity, and there are some influencial data outliers from library(boot)
.
<- glm.diag(glm1a) #residual diagnostics
model_diags glm.diag.plots(glm1a,model_diags)
Lets see how the predictions look:
names(LiPe_predictors)
<- LiPe_predictors[[which(c(TRUE,TRUE,
LiPe_preds_subset TRUE,FALSE,
FALSE,FALSE))]]
<- predict(LiPe_preds_subset,glm1a,type="response")
glm_pred setwd("~/Documents/Use_cases/SDM_in_R")
Mke sure your current directory is your working directory, use getwd()
to ensure you have asigned the directory correctly.
jpeg(paste0(getwd(),"/results_glm/LiPe_glm_predicted.jpeg"))
plot(glm_pred)
points(xy.sp_LiPe,pch=20,cex=0.2)
dev.off()
Finally we will git a GAM from the mgcv
package. A cr spline
is defined by having knots spread throughout the co-variate values and which the amount of smooting is optimised, so it can become less wiggly.
<- gam(pres ~ s(AWAP, bs="cr") +
gam1 s(Bioclim05, bs="cr") +
s(Bioclim06,bs="cr"),
family=binomial, data=LiPe_preds3)
gam.check(gam1)
If we compare AIC values it looks like the GAM picks up some important non-linear relationships and is better than a GLM for this variable set anyway. BIC also suggests gam1 is better than either glm model:
AIC(glm1a)
AIC(gam1)
plot(gam1)
<- predict(LiPe_preds_subset,gam1,type="response")
gam_pred setwd("~/Documents/Use_cases/SDM_in_R")
Make sure your current directory is your working directory. Generate a jpeg map of the GAM predictions:
jpeg(paste0(getwd(),"/results_gam/LiPe_glm_predicted.jpeg"))
plot(gam_pred)
points(xy.sp_LiPe,pch=20,cex=0.2)
dev.off()
List of Maxent args (begin call with maxent_args <- c(
):
removeduplicates=TRUE
, remove duplicate presence records. If environmental data are in grids, duplicates are records in the same grid cell, otherwise, duplicates are records with identical coordinates.
maximumbackground=10000
, if the number of background points/grid cells is larger than this number, then this number of cells is chosen randomly for background points.maximumbackground=10000
, if the number of background points/grid cells is larger than this number, then this number of cells is chosen randomly for background points.addsamplestobackground=TRUE
, add to the background any sample for which has a combination of environmental values that isn`t already present in the backgroundallowpartialdata=FALSE
, during model training, allow use of samples that have nodata values for one or more environmental variablesjackknife=TRUE
, NB: default=FALSE; measure importance of each environmental variable by training with each environmental variable first omitted, then used in isolation.randomseed=FALSE
, if selected, a different random seed will be used for each run, so a different random test/train partition will be made and a different random subset of the background will be used, if applicable.defaultprevalence=0.5
, default prevalence of the species: probability of presence at ordinary occurrence points. See Elith et al. Diversity and Distributions, 2011 for details.randomtestpoints=0
, percentage of presence localities to be randomly set aside as test points, used to compute AUC, omission, etc.replicates=1
, number of replicate runs to do when cross-validating, bootstrapping or doing sampling with replacement runs. replicatetype=crossvalidate
, if replicates > 1, do multiple runs of this type:maximumiterations=500
, stop training after this many iterations of the optimization algorithmconvergencethreshold=0.00001
, stop training when the drop in log loss per iteration drops below this numberautofeature=TRUE
, automatically select which feature classes to use, based on number of training sampleslinear=TRUE
, allow linear features to be usedquadratic=TRUE
, allow quadratic features to be usedproduct=TRUE
, allow product features to be usedthreshold=TRUE
, allow threshold features to be usedhinge=TRUE
, allow hinge features to be usedlq2lqptthreshold=80
, number of samples at which product and threshold features start being usedl2lqthreshold=10
, number of samples at which quadratic features start being usedhingethreshold=15
, number of samples at which hinge features start being usedbetamultiplier=1
, multiply all automatic regularization parameters by this number. A higher number gives a more spread-out distribution.beta_threshold=-1
, regularization parameter to be applied to all threshold features; negative value enables automatic settingbeta_categorical=-1
, regularization parameter to be applied to all categorical features; negative value enables automatic settingbeta_lqp=-1
, regularization parameter to be applied to all linear, quadratic and product features; negative value enables automatic settingbeta_hinge=-1
, regularization parameter to be applied to all hinge features; negative value enables automatic settingresponsecurves=TRUE
, NB. default=FALSE; create graphs showing how predicted relative probability of occurrence depends on the value of each environmental variableresponsecurvesexponent=FALSE
, instead of showing the logistic value for the y axis in response curves, show the exponent (a linear combination of features).pictures=TRUE
, create a .png image for each output gridoutputformat=raw
, representation of probabilities used in writing output grids, see Help for detailswriteclampgrid=TRUE
, write a grid that shows the spatial distribution of clamping. At each point, the value is the absolute difference between prediction values with and without clamping.writemess=TRUE
, a multidimensional environmental similarity surface (MESS) shows where novel climate conditions exist in the projection layers. The analysis shows both the degree of novelness and the variable that is most out of range at each point.writeplotdata=FALSE
, write output files containing the data used to make response curves, for import into external plotting software.
outputgrids=TRUE
, write output grids. Turning this off when doing replicate runs causes only the summary grids (average, std, deviation, etc) to be written, not those for the individual runs.plots=TRUE
, write various plots for inclusion in .html outputlogfile=maxent.log
, file name to be used for writing debugging information about a run in output directoryapplythresholdrule=Fixed cumulative value 1
, apply a threshold rule, generating a binary outputgrid in addition to the regular prediction grid. Use the full name of the threshold rule in Maxent’s html output as the argument. For example, applyThresholdRule=Fixed
cumulative value 1.logscale=TRUE
, if selected, all pictures of models will use a logarithmic scale for color-codingwritebackgroundpredictions=FALSE
, write .csv file with predictions at background pointsfadebyclamping=FALSE
, reduce prediction at each point in projections by the difference between clamped and non-clamped output at that pointextrapolate=TRUE
, predict to regions of environmental space outside the limits encountered during trainingdoclamp=TRUE
apply clamping when projecting::p_load(raster, rgdal, caret, dismo, gbm, raster, sp, jpeg, galah, tidyr) pacman
There are many ways to evaluate your model. With statistical models you can evaluate residual plots to check that your model meets assumptions.
It is often a good idea to break your data into randomly selected training and testing data. Often you would randomly remove 20% or more of your data and withold that from the model building process. Once your model was finsihed using your training data, you would then test your model with this training data. If you have very little data, bootstrapping can be used to see if removal of a small percentage of data repeatedly changes results. This gives you a good understanding of the confidence intervals around your results and can reduce the impact of outliers on your final result.
Cross-validation requires more data. Ideally for cross-validation you break your data into 10 folds (subsets) and compare results between folds, or summarise variability between folds. If you can afford to set aside 20% + of your data, withholding a test dataset is good practice. Ideally, you will test your model with completely independent data.
Here we evaluate the BRT model with the training data used to construct the model. First we run the BRT model again without additional variables identified in our simplify function.
<- gbm.step(data=LiPe_preds3, gbm.x=c(4:7), gbm.y=3, family="bernoulli", tree.complexity=3, learning.rate=0.01, bag.fraction=0.75) LiPe_mod_new
<- LiPe_preds3[LiPe_preds3$pres==1,]
LiPe_1s <- LiPe_1s$Long
x_1s <- LiPe_1s$Lat
y_1s <- cbind(x_1s,y_1s)
xy_1s <- SpatialPoints(xy_1s)
xy.sp_1s crs(xy.sp_1s) <- crs(LiPe_predictors)
crs(xy.sp_1s)
<- LiPe_preds3[LiPe_preds3$pres==0,]
LiPe_0s <- LiPe_0s$Long
x_0s <- LiPe_0s$Lat
y_0s <- cbind(x_0s,y_0s)
xy_0s <- SpatialPoints(xy_0s)
xy.sp_0s crs(xy.sp_0s) <- crs(LiPe_predictors)
crs(xy.sp_0s)
Look at which predictors are in our predictors stack:
names(LiPe_predictors)
Here we drop the variables that were dropped from the LiPe_mod_new BRT model. This is so we can predict all locations using the subset of variables used in the model. In any model, the variables you use, need to match the variables you use to predict column names, or raster stack names need to match exactly.
<- LiPe_predictors[[which(c(TRUE,TRUE,
LiPe_preds_brt_new TRUE,TRUE,
FALSE,FALSE))]]
<- predict(LiPe_preds_brt_new,
brt_pred
LiPe_mod_new,type="link") #note we usually use type link, but the evaluate function is based on the "link" scale
<- dismo::evaluate(xy.sp_1s,
brt_eval_training
xy.sp_0s,
LiPe_mod_new, LiPe_preds_brt_new)
There are a variety of ways to calculate a threshold, maximising “kappa” is one well supported method, but there are many othersin Maxent we saw a table with a variety of ways to calculate thresholds.
<- threshold(brt_eval_training, stat="kappa") thresh_brt
Another approach to thresholding - there are many choose the one that best fits in your model
<- threshold(brt_eval_training, stat="spec_sens")
thresh_brt2
<- c(-5, thresh_brt, 0, thresh_brt, 2.1, 1)
m <- matrix(m, ncol= 3, byrow=TRUE)
reclass <- reclassify(brt_pred, reclass)
rc_brt
plot(rc_brt)
points(xy.sp_LiPe,pch=20,cex=0.2)
Look at training evaluation:
brt_eval_training
<- LiPe_preds3$Long
x_s <- LiPe_preds3$Lat
y_s <- cbind(x_s,y_s)
xy_s <- SpatialPoints(xy_s)
xy.sp_s crs(xy.sp_s) <- crs(LiPe_predictors)
<- raster::extract(rc_brt,xy.sp_s)
predicted
<- as.data.frame(cbind(predicted=predicted,
test1 reference=LiPe_preds3$pres))
$reference <- factor(test1$reference, levels=c("1","0"))
test1$predicted <- factor(test1$predicted, levels=c("1","0"))
test1
<- confusionMatrix(test1$predicted,test1$reference) conf_matrix
This gives us the confusion matrix:
$table
conf_matrix
<- conf_matrix$table[1,1]
TP <- conf_matrix$table[1,2]
FP <- conf_matrix$table[2,1]
FN <- conf_matrix$table[2,2]
TN
<- FP/(FP + TN)
FPR <- FN/(FN + TP)
FNR <- TP/(TP + FN)
TPR <- TN/(TN + FP)
TNR
<- TPR + TNR - 1 #if we were predicting perfectly our TSS would=1
TSS #we are far from perfect in this model, and a much lower number that AUC TSS
A perfect AUC value would also be 1. TSS often gives a better indication of prediction because it takes into account anbalanced errors between positives and negatives.
<- TP / (TP + FP)
Precision <- TP / (TP + FN)
Recall <- (2*Precision*Recall)/(Precision+Recall) #F1 is another statistic, now thought to be F1
Better than TSS:
<- TP /(TP + 0.5*(FP + FN)) #this is another way to calculate the same value F1
But the literature refers to Precision and Recall above (same thing though), again perfect score would be 1.
#as you can see in our case F1 falls between AUC and TSS scores F1
Now we are going to test our results from ALA records that were not from the FrogID project for the most part we are just repeating code from step 1.
galah_config(email="r.clemens@griffith.edu.au")
<- galah_call() %>%
LiPe_ALA galah_identify("Limnodynastes peroni") %>%
galah_filter(datasetName != "FrogID") %>%
galah_filter(coordinateUncertaintyInMeters < 200) %>%
galah_filter(year>1999)%>%
galah_filter(stateProvince == "Queensland") %>%
galah_select("datasetName","year") %>%
atlas_occurrences()
<- na.omit(LiPe_ALA)
LiPe_ALA
<- SpatialPoints(coords=cbind(LiPe_ALA$decimalLongitude, LiPe_ALA$decimalLatitude),CRS(as.character("+proj=longlat +datum=WGS84 +no_defs")) )
LiPe_pts_ALA
<- rep(1,length(LiPe_ALA$decimalLatitude)) LiPe_pres_ALA
Then create a raster with a value of 1 for each gridcell where a LiPe was recorded.
<- raster::rasterize(LiPe_pts_ALA,base,LiPe_pres_ALA, fun=min, background=0)
LiPe_pres_raster_ALA <- raster::projectRaster(LiPe_pres_raster_ALA,base)
LiPe_pres_raster2_ALA <- raster::mask(LiPe_pres_raster2_ALA,base)
LiPe_pres_raster2_ALA <- aggregate(base, fact=4)
large_base <- raster::extract(large_base,LiPe_pts_ALA,cellnumbers=TRUE)
cell_no_ALA <- cbind(LiPe_ALA,cell_no_ALA) LiPe_cells_ALA
require(dplyr)
<- LiPe_cells_ALA %>%
LiPe_thinned_ALA group_by(cells) %>%
slice_sample(n=1)
<- galah_call() %>%
frogs_ALA galah_identify("Amphibia") %>%
galah_filter(datasetName != "FrogID") %>%
galah_filter(coordinateUncertaintyInMeters < 100) %>%
galah_filter(year>1999) %>%
galah_filter(stateProvince == "Queensland") %>%
galah_select("datasetName","year") %>%
atlas_occurrences()
$unique_visit <- paste0(frogs_ALA$decimalLatitude,
frogs_ALA$decimalLongitude,
frogs_ALA$eventDate)
frogs_ALA$visitID <- as.numeric(as.factor(frogs_ALA$unique_visit)) frogs_ALA
<- frogs_ALA %>%
frogs_ALA2 group_by(decimalLatitude, decimalLongitude, visitID) %>%
summarise(no_spp=length(unique(scientificName)))
<- frogs_ALA2 %>%
frogs_ALA3 group_by(decimalLatitude, decimalLongitude) %>%
summarise(no_visits=length(visitID))
<- na.omit(frogs_ALA3)
frogs_ALA3
<- SpatialPoints(coords=cbind(frogs_ALA3$decimalLongitude,
visits_pts_ALA $decimalLatitude),
frogs_ALA3CRS(as.character("+proj=longlat +datum=WGS84 +no_defs")))
<- rasterize(visits_pts_ALA, base,
b1_ALA $no_visits,
frogs_ALA3fun=sum, background=0)
<- bbox(b1_ALA)
bb_ALA <- raster::extract(b1_ALA,
visit_locations_ALA
visits_pts_ALA, cellnumbers=TRUE)
<- as.data.frame(na.omit(visit_locations_ALA))
visit_locations2_ALA <- unique(visit_locations2_ALA$cells) cellID_ALA
<- raster::xyFromCell(b1_ALA,cell=cellID_ALA)
xy_visits_ALA
cellStats(b1_ALA ,"max")
<- c(0, 2.9, 0, 2.9, 880, 1)
m <- matrix(m, ncol= 3, byrow= TRUE)
reclass <- reclassify(b1_ALA, reclass)
rc_ALA
<- mask(rc_ALA,base)
visits_3or_more_ALA
<- visits_3or_more_ALA - LiPe_pres_raster2_ALA
Zero_LiPe_ALA
freq(Zero_LiPe_ALA)
<- c(-2, 0.1, 0, 0.1, 2, 1)
m <- matrix(m, ncol= 3, byrow= TRUE)
reclass2 <- reclassify(Zero_LiPe_ALA, reclass2)
rc2_ALA <- mask(rc2_ALA,base)
Zero_LiPe2_ALA freq(Zero_LiPe2_ALA)
Extract the cell numbers from the 0 grid where the value ==1:
<- Which(Zero_LiPe2_ALA ==1,cells=TRUE) cell_vals_0_ALA
These are the lat / longs for locations where at least three surveys were done, but zero LiPe were detected - these are our pseudo absences.
<- xyFromCell(Zero_LiPe2_ALA,cell=cell_vals_0_ALA) xy_zero_LiPe_ALA
Plot the absence locations:
plot(base)
points(xy_zero_LiPe_ALA, pch=20,cex=0.2)
<- as.data.frame(xy_zero_LiPe_ALA) xy_zero_LiPe_ALA
Now lets turn our presence and absence points into one dataset:
head(LiPe_thinned_ALA)
<- as.data.frame(cbind(Long=LiPe_thinned_ALA$decimalLongitude,
ALA_preds1 Lat=LiPe_thinned_ALA$decimalLatitude,
pres=1))
<- as.data.frame(cbind(Long=xy_zero_LiPe_ALA$x,
ALA_preds2 Lat=xy_zero_LiPe_ALA$y,
pres=0))
<- rbind(ALA_preds1,ALA_preds2)
ALA_preds
<- ALA_preds$Long
x_a <- ALA_preds$Lat
y_a <- cbind(x_a,y_a)
xy_a <- SpatialPoints(xy_a)
xy.sp_a crs(xy.sp_a) <- crs(LiPe_predictors)
Now we just extract the 1’s & 0’s using our lat longs from the new ALA data extracted from the same thresholded brt predictions from above.
<- raster::extract(rc_brt,xy.sp_a)
predicted_a
<- as.data.frame(cbind(predicted=predicted_a,
test2 reference=ALA_preds$pres))
$reference <- factor(test2$reference, levels=c("1","0"))
test2$predicted <- factor(test2$predicted, levels=c("1","0"))
test2
<- confusionMatrix(test2$predicted,
conf_matrix2 $reference)
test2
conf_matrix2
<- conf_matrix2$table[1,1]
TP2 <- conf_matrix2$table[1,2]
FP2 <- conf_matrix2$table[2,1]
FN2 <- conf_matrix2$table[2,2]
TN2
<- FP2/(FP2 + TN2)
FPR2 <- FN2/(FN2 + TP2)
FNR2 <- TP2/(TP2 + FN2)
TPR2 <- TN2/(TN2 + FP2)
TNR2
<- TPR2 + TNR2 - 1
TSS2 #notice this is a big drop from our original TSS statistic
TSS2
TSS
<- TP2 / (TP2 + FP2)
Precision2 <- TP2 / (TP2 + FN2)
Recall2 <- 2*((Precision2*Recall2)/(Precision2+Recall2))
F1_b #notice this is a very slight drop from our original F1 statsitic
F1_b F1
When we used data to test the model that was not associated with the FrogID project, TSS, accuracy, and precision scores all fell markedly. Encouragingly the F1 score did not fall that much, and as a user you will need to decidd on what levels of accuracy in your confusion matrix are good enough for the intended use of the model.
If I was looking to predict well in order to select what reserves to make for this frog, I would want better performance.
The first step I would take to improve this model, would be to select pseudo_absence locations at only those locations where similarly distributed frogs were seen but our target species was not seen on three visits. Our psuedo absence points are still including places in the arid interior where this species would never occur.
Second, as mentioned before, I would also look for better freshwater wetland layers, perhaps look at how often an area is wet over the last decade.
sessionInfo()