Constraints of using historical data for modelling the spatial distribution of helminth parasites in ruminants

Special Issue – Combatting Anthelmintic resistance in ruminants. Invited Editors: Johannes Charlier, Hervé Hoste, and Smaragda Sotiraki

Open Access

ODMAP checklist

Hendrickx et al., 2020: Constraints of using historical data for modelling the spatial distribution of helminth parasites in ruminants.

ODMAP element	Contents
Overview
Authorship	Authors: Hendrickx A, Marsboom C, Rinaldi L, Rose Vineer H, Morgoglione EM, Sotariki S, Cringoli G, Claerebout E and Hendrickx G.
	Contact email: ghendrickx@avia-gis.com
	Title: Constraints of using historical data for modelling the spatial distribution of helminth parasites in ruminants.
	DOI:
Model objective	Objective: Explanation < Mapping (Inference < Interpolation).
Model objective	Target outputs: Maps of relative probability of presence for Italy.
Taxon	Parasitic helminth, Dicrocoelium dendriticum.
Location	Italy.
Scale of analysis	Spatial extent (Lon/Lat): 6.7–18.5 °E, 36.6–47.12 °N.
	Spatial resolution: 1 km.
	Temporal extent/time period: 1999–2018.
	Type of extent/boundary: Administrative boundary (Italian border).
Biodiversity data overview	Observation type: Veterinary diagnostic data and field survey.
Biodiversity data overview	Response/data type: Presence–absence.
Type of predictors	Vegetation, bioclimatic, livestock density.
Conceptual model/hypothesis	Hypotheses about species-environment relationships: There is evidence that the distribution of D. dendriticum is driven by vegetation and climate (both directly and indirectly via intermediate host influences). However, the quality, temporal resolution and quantity of available occurrence data may constrain the application of species distribution models to predict the distribution of D. dendriticum cases. We developed SDMs to evaluate the impact of subsetting historic occurrence data on model performance.
Assumptions	We assumed that: Diagnostic data were representative of presence or absence of infection in the host. Sensitivity of diagnostic data does not change in space or time. The chosen environmental covariates represent all relevant environmental drivers of distribution. The data encompass the species’ realised niche in the area modelled (after bias-correction – see below). Sample selection bias is adequately corrected (see below).
SDM algorithms	Algorithms: Random Forest – this method was chosen because of experience with the model and good performance in previous modelling exercises.
	Model complexity: 500 trees with 8 variables.
	Ensembles: Not applicable (except for bagging implicit in the Random Forest algorithm).
Model workflow	After preparation of environmental covariates, removal of errors and bias-correction (see below), Random Forest models were fitted to the full dataset, and to a reduced set of covariates identified as important in the full model. This process was repeated for incrementally increasing sample sizes (see portioning information below) to identify the minimal sample size, below which statistical performance deteriorates. Models were also fitted using the same process to annual occurrence data.
Software	Software: Environmental variable processing was completed in R v3.4.3 (R Core Team, 2017 [43]). Mapping and Random Forest modelling were completed in VECMAP^® (https://www.avia-gis.com/vecmap).
Data
Biodiversity data	Taxon names:Dicrocoelium dendriticum.
	Ecological level: Species.
	Data source: CReMoPAR, a parasitological reference lab from the Naples (Italy) area. Diagnostic data collected 1999–2018.
	Sampling design: Samples submitted from throughout Italy for diagnosis (faecal egg counts), opportunistic samples collected in the region surrounding CReMoPAR.
	Sample size: 5131 occurrences.
	Regional mask: Data were clipped to the Italian boundary.
	Scaling: Not applicable.
	Background data: Not applicable.
	Errors and biases: Parts of Italy not represented in the dataset were masked using an environmental envelope generated using a Mahalanobis distance approach. The area within this environmental envelope was used for model development to avoid projecting model predictions outside of the range of the occurrence data. The data set was balanced by randomly subsampling the largest class.
Data partitioning	A model was developed using the full datasets to demonstrate the “best-case scenario” (BCS), before reducing the size of occurrence dataset in 10% increments at random, to evaluate the impact of sample size on model performance.
Data partitioning	Models were also fitted to data for the 5 years between 1999 and 2018 with the highest occurrence data sample size to evaluate the impact of dataset on model performance.
Predictor variables	Predictor variables and data sources: NDVI data from MODIS (http://modis.gsfc.nasa.gov/), Fourier-transformed according to the methods described by Estrada-peña et al. [13]. Bioclimatic variables [17] derived from ERA5 [1] temperature data. Gridded Livestock of the World livestock density data [33].
	Spatial resolution and extent of the raw data: The livestock density data were available at a 10 km resolution. Bioclimatic data were available at a 1 km resolution. NDVI data were available at a 1 km resolution.
	Geographic projection: WGS84.
	Temporal resolution and extent of the raw data: The livestock density data represent predicted livestock density for 2011. Bioclimatic variables and NDVI data were averaged for the temporal extent of the occurrence dataset (1999–2018).
	Data processing: The extent of all data were clipped to the Italian boundary before processing. Resampling/aggregation was not performed to standardise resolution.
Model
Variable pre-selection	The choice of initial covariates was made as a compromise between availability and ecological/biological relevance to the study species. Only weakly correlated covariates were included in the models.
Multicollinearity	Multicollinearity between the covariates was investigated using the variance inflation factor and Spearman rank correlations. Covariates with VIF > 10 were discarded. Only one variable from pairs with correlations >0.7 was retained to avoid model overfitting.
Model settings	Default settings were used throughout, except for the number of replicates and the number of variables to evaluate at each node. For initial models using all variables (variable selection step), 500 replicates and 8 variables were specified. For models using the reduced set of variables selected for their importance, 100 replicates and 6 nodes were specified.
Model estimates	Covariate importance was estimated with mean decrease accuracy and mean decrease Gini.
Model averaging/ensembles	Not applicable.
Non-independence	Not done, see discussion.
Assessment
Performance statistics	Model evaluation is based on standard model statistics. These include Sensitivity, Specificity, Cohen’s Kappa, and Area Under Curve (AUC).
Plausibility checks	Expert analysis is used to evaluate the plausibility of the mapped model outputs.
Prediction
Prediction output	Predictions of relative probability of presence of D. dendriticum is expressed on a continuous scale. Maps are restricted to the environmental envelope identified using a Mahalanobis distance approach (see above) to avoid projecting outside of the range of the occurrence data.
Uncertainty quantification	Not applicable – ensembles not performed.

Current usage metrics show cumulative count of Article Views (full-text article views including HTML views, PDF and ePub downloads, according to the available data) and Abstracts Views on Vision4Press platform.

Data correspond to usage on the plateform after 2015. The current usage metrics is available 48-96 hours after online publication and is updated daily on week days.

Initial download of the metrics may take a while.