Biodiversity Research Workflows and Ecological Niche Modelling

Modern biodiversity research increasingly relies on computational workflows that integrate large datasets — species occurrence records, environmental layers, remotely sensed habitat variables — to address questions at landscape, continental, and global scales. Among these, Ecological Niche Modelling (ENM), also referred to as Species Distribution Modelling (SDM), is one of the most widely used analytical frameworks in contemporary ecology and conservation biology.

What Is Ecological Niche Modelling?

Ecological Niche Modelling is a family of statistical and machine-learning methods that relate known species occurrence records to environmental predictors — climate variables, elevation, soil properties, land cover — to estimate the spatial distribution of suitable habitat. The output is typically a continuous probability or suitability surface: a raster map where each pixel receives a score reflecting how closely its environmental conditions match those associated with known species occurrences.

The conceptual foundation draws on Hutchinson's n-dimensional hypervolume model of the ecological niche (1957): the set of all environmental conditions under which a species can maintain viable populations. ENM algorithms attempt to characterise this hypervolume from occurrence data and then project it across geographic space — identifying where the same or similar conditions exist.

Core ENM Algorithms

Several algorithmic approaches are routinely used, each with different statistical assumptions and input requirements:

  • MaxEnt (Maximum Entropy) — the most widely used ENM algorithm. It uses presence-only occurrence data and background environmental values to estimate habitat suitability. MaxEnt's output is an index of relative suitability (0–1), and it performs well even with small sample sizes when properly configured. The algorithm is implemented in the freely available MaxEnt software and the R package dismo.
  • Bioclim — a simple envelope model that identifies the range of environmental conditions present at occurrence localities and classifies unseen locations by their percentile within that range. Bioclim is computationally lightweight and easy to interpret but tends to overfit in complex environmental spaces.
  • GLM/GAM (Generalised Linear/Additive Models) — regression-based approaches that require both presence and absence data. They are more statistically explicit than presence-only methods and allow uncertainty estimation, but depend on the quality of absence (or pseudo-absence) records.
  • Random Forest and Boosted Regression Trees (BRT) — ensemble machine-learning methods that typically achieve high predictive accuracy and can handle complex non-linear relationships and variable interactions. They require larger training datasets and are less interpretable than simpler models.

In practice, researchers commonly use ensemble modelling — combining predictions from multiple algorithms — to reduce model-specific uncertainty and produce more robust distribution estimates (the BIOMOD2 R package is widely used for this purpose).

Input Data

Species occurrence records are typically sourced from GBIF (the Global Biodiversity Information Facility, gbif.org), which aggregates over 2.4 billion occurrence records from natural history collections, citizen science platforms, and research surveys. Data quality varies widely; a standard ENM workflow includes cleaning steps to remove duplicate records, geographic outliers, records without coordinate precision sufficient for the analysis resolution, and records based on cultivated or captive individuals.

Environmental predictor layers most commonly used are the WorldClim 2.1 and CHELSA bioclimatic variable sets at 30 arc-second (~1 km) resolution, which provide 19 bioclimatic variables derived from temperature and precipitation time series. For marine species, Bio-ORACLE provides comparable oceanographic layers. Topographic variables (elevation, slope, aspect, topographic wetness index) from digital elevation models are frequently included for terrestrial species with fine-scale habitat preferences.

Applications in Conservation and Research

ENM is applied across a wide range of ecological and conservation questions:

  • Climate change projections — projecting current habitat suitability models onto future climate scenarios (CMIP6 model outputs) to identify which species face range contraction, expansion, or shift, and which geographic areas will gain or lose viable habitat.
  • Invasive species risk assessment — using ENM to identify regions climatically suitable for an invasive species before it arrives, enabling preemptive biosecurity measures.
  • Reserve design and gap analysis — identifying areas of high predicted suitability for threatened species that fall outside current protected area networks.
  • Population modelling integration — linking ENM-derived habitat suitability surfaces to population ecology models to estimate spatially explicit population viability across fragmented landscapes.
  • Biodiversity hotspot refinement — identifying priority micro-areas within biodiversity hotspots that concentrate the largest number of endemic species with suitable habitat.

Limitations and Best Practices

ENM results are sensitive to the quality and spatial bias of occurrence data, the choice of environmental predictors, the extent and resolution of the study area, and the specific algorithms used. Key limitations include the assumption of equilibrium between species distributions and current climate (which may be violated for range-shifting species), the difficulty of incorporating biotic interactions (competition, facilitation), and the challenge of validating model predictions in regions where the species is absent due to dispersal limitation rather than environmental unsuitability.

Best practice guidelines (Araújo et al. 2019 in Ecography) recommend: independent data partitioning for cross-validation, spatial blocking of training and test data to avoid spatial autocorrelation, use of multiple evaluation metrics (AUC, TSS, Boyce index), ensemble modelling, and explicit reporting of uncertainty through prediction variance maps.

Frequently Asked Questions

What is the difference between ENM and SDM?

Ecological Niche Modelling (ENM) and Species Distribution Modelling (SDM) are largely interchangeable terms in practice. ENM emphasises the theoretical basis — characterising the Hutchinsonian niche from occurrence data and environmental predictors. SDM emphasises the output: a predicted geographic distribution. Most researchers now use both terms to describe the same analytical workflow — fitting a model on occurrence-environment associations, then projecting it onto a geographic surface to produce a habitat suitability map.

How many occurrence records are needed for a reliable ENM analysis?

There is no universal minimum, but most practitioners consider 20–30 spatially independent occurrence records to be a practical lower bound for presence-only methods such as MaxEnt. Performance improves substantially up to about 100–200 records and tends to plateau beyond that for most species–environment relationships. Critically, spatial independence matters more than raw count: 50 clustered records from a single field survey provide less information than 20 records spread across the species' known range.

Which environmental predictors are most commonly used in terrestrial ENM?

The WorldClim 2.1 and CHELSA bioclimatic variable sets are the most widely used for terrestrial species, providing 19 variables derived from monthly temperature and precipitation data. Topographic variables — elevation, slope, topographic wetness index — from 30 m SRTM or Copernicus DEM data are frequently added for fine-scale analyses. Variable selection using variance inflation factor (VIF) or Pearson correlation thresholds to remove multicollinear predictors is standard practice before model fitting.

What is MaxEnt and why is it so widely used?

MaxEnt (Maximum Entropy) estimates the probability distribution of a species by finding the distribution of maximum entropy — the most uniform distribution — subject to constraints imposed by the species' known occurrence records and background environmental data. It requires only presence records (no absences), performs relatively well with small sample sizes, and produces interpretable response curves for each environmental predictor. These properties make it practical for most taxa and geographies, which is why it became the most cited ENM algorithm.

What are the main steps in an ENM workflow?

A typical ENM workflow proceeds through five stages: (1) Occurrence data assembly — download records from GBIF or other repositories, clean for duplicates and spatial errors; (2) Environmental data preparation — select bioclimatic and topographic predictor layers, remove multicollinear variables using VIF thresholds; (3) Model fitting — calibrate the chosen algorithm (MaxEnt, Random Forest, GLM) on the cleaned dataset, applying spatial cross-validation; (4) Model evaluation — assess performance using AUC-ROC, Boyce index, or TSS against withheld test data; (5) Projection — transfer the fitted model onto geographic space to produce habitat suitability maps for current or future climate conditions.

How is ENM model performance evaluated?

The most widely used metric is the area under the receiver operating characteristic curve (AUC-ROC), which measures the model's ability to discriminate presence locations from background points — ranging from 0.5 (no discrimination) to 1.0 (perfect). The True Skill Statistic (TSS) adjusts for prevalence and is preferred for presence-absence data. For presence-only data, the continuous Boyce index is increasingly recommended because it avoids the arbitrariness of absence sampling. Spatial cross-validation — partitioning occurrence records by geographic block rather than randomly — is considered best practice to avoid spatial autocorrelation inflating apparent model performance.

Where can biodiversity occurrence records be obtained for ENM?

The Global Biodiversity Information Facility (GBIF) is the primary open-access repository, aggregating over 2.3 billion records from natural history collections, citizen science platforms, and research databases worldwide. Other important sources include iNaturalist (citizen science), OBIS (Ocean Biodiversity Information System) for marine species, VertNet for vertebrates, and regional herbarium networks. For historical or taxonomically difficult groups, targeted literature searches and museum voucher specimens may be necessary to supplement aggregated databases. Data cleaning — removing duplicate coordinates, historical records with low spatial precision, and records outside the species' known range — is always required before modelling.

How can ENM be used to predict the impacts of climate change on species?

ENM supports climate change impact assessment by fitting a model on current occurrence-environment relationships, then projecting (transferring) that model onto future climate scenarios from the Coupled Model Intercomparison Project (CMIP5 or CMIP6). The resulting maps show predicted range shifts — areas becoming climatically suitable or unsuitable under different warming trajectories. Analysts typically run projections across multiple climate models and emission scenarios (SSPs) to capture uncertainty. Key outputs include estimated range area change, projected shifts in range centroids, and identification of climate refugia — areas likely to retain suitable conditions through the century. Ensemble modelling across multiple algorithms is standard practice to reduce algorithm-specific bias.