Constraints of using historical data for modelling the spatial distribution of helminth parasites in ruminants

Dicrocoelium dendriticum is a trematode that infects ruminant livestock and requires two different intermediate hosts to complete its lifecycle. Modelling the spatial distribution of this parasite can help to improve its management in higher risk regions. The aim of this research was to assess the constraints of using historical data sets when modelling the spatial distribution of helminth parasites in ruminants. A parasitological data set provided by CREMOPAR (Napoli, Italy) and covering most of Italy was used in this paper. A baseline model (Random Forest, VECMAP®) using the entire data set was first used to determine the minimal number of data points needed to build a stable model. Then, annual distribution models were computed and compared with the baseline model. The best prediction rate and statistical output were obtained for 2012 and the worst for 2016, even though the sample size of the former was significantly smaller than the latter. We discuss how this may be explained by the fact that in 2012, the samples were more evenly geographically distributed, whilst in 2016 most of the data were strongly clustered. It is concluded that the spatial distribution of the input data appears to be more important than the actual sample size when computing species distribution models. This is often a major issue when using historical data to develop spatial models. Such data sets often include sampling biases and large geographical gaps. If this bias is not corrected, the spatial distribution model outputs may display the sampling effort rather than the real species distribution.


Introduction
The lancet liver fluke Dicrocoelium dendriticum is a parasite of the bile ducts and gallbladder of different mammalian species (mainly ruminants), including humans [31,34].
The life cycle of this parasite requires two invertebrate intermediate hosts: one being a xerophilic terrestrial snail (of various genera such as Helicella, Zebrina or Cernuella), and the other an ant (mainly of the genus Formica) [25].
Clinical signs in ruminants are not usually manifest, even in severe infections, and therefore, major lesions, due to liver impairment are detectable only at post-mortem examination [31,34]. Lesions are directly proportional to the parasitic burden [23] and chronic inflammation of the bile ducts [8]. In the early stages of the infection, reduced weight gain can be detected, but the infection is usually asymptomatic [34] resulting only in livers being discarded during meat inspection at slaughterhouses or with an appropriate coprodiagnostic analysis [35]. In severe cases, infection can lead to emaciation, anaemia with economic losses in production, and viscera condemnation in animals [2,14,35]. Therefore, D. dendriticum is, together with Fasciola hepatica, one of the leading causes of discarded livers in the abattoir, with associated economic losses. Dicrocoeliosis is also zoonotic [22,36]. Hence, modelling the spatial distribution of this parasite can help to improve its management in higher risk regions [11,27]. Due to global climate change, seasonal and spatial patterns of parasites can alter [7]. This also includes indirect effects of climate change such as management changes [32]. Extended grazing periods, animal movements and anti-helminthic resistance leading to treatment failure are important drivers that boost the presence of parasites [5,15,19,28].
Development of D. dendriticum is distinctly dependent on ecology, geo-climatic factors and anthropogenic factors. This is mainly due to its intermediate hosts that require highly specific environmental niches such as calcareous or alkaline soils [29]. This results in a widespread presence of this trematode throughout Europe with locally heterogeneous spatial distribution patterns and a significant variation in local prevalence [25].
The environment affects the phases of the parasite lifecycle. Therefore, it is important to include these factors while making risk maps. The environmental factors are geolocated [18]. Species distribution modelling (SDM), also known as environmental modelling, is a tool that combines different observations of species presence or absence with environmental predictors such as temperature, rainfall, elevation, soil type, and vegetation that are ecologically-relevant to the species being modelled [12].
The general approach to designing species distribution maps is shown in Figure 1. First, a number of grid-cells (A), in this case farms, are randomly selected within a larger area and are sampled to obtain occurrence data (presence = red, absence = green) (B). Second, a set of environmental data provides information for all the pixels in the sample area (C).
These are termed co-variates or predictor variables. Finally, different modelling methods can be used to predict the probability of occurrence of trematodes within each of the grid-cells, which generates a risk map covering the entire area (D).
To develop high-accuracy risk maps, it is pivotal to use the right combination of predictors [13]. Clustering can be observed for D. dendriticum in the southern part of Italy mainly due to the specific environmental needs of the intermediate hosts [6,29]. Ekstam et al. [11] and Musella et al. [29] showed that the prevalence of D. dendriticum increases in areas with woody vegetation and decreases in wet areas. Species distribution model algorithms and the accuracy of model output are also sensitive to the sample size of species occurrence records [41], and spatial sample selection bias [3].
The aim of this study was to evaluate the impact of sample size on model predictive performance for D. dendriticum in Italy, using a historical longitudinal data set of diagnostic data, and to evaluate the utility of opportunistic diagnostic data at a higher temporal resolution for predicting the distribution of this species.

Materials and methods
Overview A Random Forest (RF) species distribution modelling algorithm was applied to D. dendriticum diagnosis data and environmental covariates to predict the spatial probability distribution of this species in Italy. The model was replicated using random subsets of the occurrence dataset to determine the sample size threshold below which model performance deteriorates. Models developed using annual subsets of occurrence data were compared against this threshold to evaluate the impact of using historic datasets with restricted sample size and temporal resolution on model performance.
Model development is documented according to the ODMAP guidelines for reporting Species Distribution Models ( [42]; Appendix).

Data
The preparation of both the covariate and the disease data was conducted in R, version 3.4.3 [39]. VECMAP Ò was used to generate a basic presence-absence map for rapid visualisation of the georeferenced presence and absence of the parasites over the years.

D. dendriticum occurrence data
The study area for this research is Italy, which is divided in 20 regions (Fig. 2) and has a surface area of 301.340 km 2 . Its climate varies depending on the region. Overall low precipitation is seen and the mean temperatures vary between 5 and 20°C depending on the region (http://koeppen-geiger. vu-wien.ac.at/). D. dendriticum data covering most of Italy ranging from 1999 to 2018 were provided by the Regional Centre for Monitoring of Parasitosis (CREMOPAR), Campania Region, Southern Italy. A wide variety of sources contributed to the historical data set, including scattered samples collected by veterinary practitioners, passive surveillance, and clustered data  collected as part of active surveillance programmes in limited areas. None of these data sources were free from bias (e.g. samples from veterinarians were biased towards symptomatic cases) and most of the surveys were conducted in southern Italy due to the continuous monitoring service offered by the Department of Agriculture of the Campania Region, through the activities of CREMOPAR.

Environment covariates
In total, the predictor set used to develop the D. dendriticum distribution models comprised 39 variables (Table 1). Collinearity was checked using the variance inflation factor and Spearman rank correlation. Variables with VIF > 10 were removed from further analyses. Only one of variable pairs with correlation >0.7 was retained.
Monthly normalised difference vegetation index (NDVI) data with a resolution of 1 km, a vegetation index which measures the photosynthetic activity of plants, were obtained from MODerate-resolution Imaging Spectroradiometer (MODIS) imagery (http://modis.gsfc.nasa.gov/). These data were Fourier transformed using the methods of [13], to derive biologicallyrelevant secondary variables. We used this variable to help identify the habitat of D. dendriticum, as mentioned above.
BioClim variables [17] at a resolution of 30 s, were added to the set of environmental predictors. These are climate indicators that may affect species distribution, summarising the period 1970-2000, and are developed by the U.S. Geological Survey (USGS). They represent information regarding annual and seasonal conditions and differences through the different seasons in one year. This can be as a derived variable or in a time-series [30]. These data were used for the baseline model and sample size evaluation.

Abbreviation
Variable NDVI_14A0 Normalised difference vegetation index transformed Fourier analysis band 14 -A0mean NDVI_14A1 Normalised difference vegetation index transformed Fourier analysis band 14 -A1amplitude of annual cycle NDVI_14A2 Normalised difference vegetation index transformed Fourier analysis band 14 -A2amplitude of bi-annual cycle NDVI_14A3 Normalised difference vegetation index transformed Fourier analysis band 14 -A3amplitude of tri-annual cycle NDVI_14D1 Normalised difference vegetation index transformed Fourier analysis band 14 -D1variance in annual cycle NDVI_14D2 Normalised difference vegetation index transformed Fourier analysis band 14 -D2variance in bi-annual cycle NDVI_14D3 Normalised difference vegetation index transformed Fourier analysis band 14 -D3variance in tri-annual cycle NDVI_14DA Normalised difference vegetation index transformed Fourier analysis band 14 -DAcombined variance in annual, bi-annual, and tri-annual cycles NDVI_14MN Normalised difference vegetation index transformed Fourier analysis band 14 -MNminimum NDVI_14MX Normalised difference vegetation index transformed Fourier analysis band 14 -MXmaximum NDVI_14P1 Normalised difference vegetation index transformed Fourier analysis band 14 -P1phase of annual cycle NDVI_14P2 Normalised difference vegetation index transformed Fourier analysis band 14 -P2phase of bi-annual cycle NDVI_14P3 Normalised difference vegetation index transformed Fourier analysis band 14 -P3phase of tri-annual cycle NDVI_14VR Normalised difference vegetation index transformed Fourier analysis band 14 -VRvariance in raw data parameter Fourier variable image values BIO 1 Annual mean temperature (°C) BIO 2 Annual mean diurnal range (°C) BIO 3 Isothermality (°C) BIO 4 Temperature seasonality (standard deviation) (°C) BIO 5 Tmax of warmest month (°C) BIO 6 Tmin of coldest month (°C) BIO 7 Annual temperature range (°C) BIO 8 Mean temperature of wettest quarter (°C) BIO 9 Mean temperature of driest quarter (°C) BIO 10 Mean temperature of warmest quarter (°C) BIO 11 Mean temperature of coldest quarter (°C) BIO 12 Annual precipitation (mm) BIO 13 Precipitation of wettest month (mm) BIO 14 Precipitation of driest month (mm) BIO 15 Precipitation seasonality (coefficient of variation) (%) BIO 16 Precipitation of wettest quarter (mm) BIO 17 Precipitation of driest quarter (mm) BIO 18 Precipitation of warmest quarter (mm) BIO 19 Precipitation of coldest quarter (mm) TempXX_A0 Temperature of XX (depending on the year that is modelled 09, 12, 13, 14, 15, 16)amplitude of annual cycle TempXX_A1 Temperature of XX (depending on the year that is modelled 09, 12, 13, 14, 15, 16)amplitude of bi-annual cycle TempXX_A2 Temperature of XX (depending on the year that is modelled 09, 12, 13, 14, 15, 16)amplitude of tri-annual cycle TempXX_P0 Temperature of XX (depending on the year that is modelled 09, 12, 13, 14, 15, 16)phase of annual cycle TempXX_P1 Temperature of XX (depending on the year that is modelled 09, 12, 13, 14, 15, 16)phase of bi-annual cycle TempXX_P2 Temperature of XX (depending on the year that is modelled 09, 12, 13, 14, 15, 16)phase of tri-annual cycle To fit models to annual data, ERA5 temperature data for the corresponding year were used to derive bioclimatic summaries to replace these variables of the BioClim (Bio01-Bio11) data set in the individual year models. ERA5 was developed by the European Centre of Medium-Range Weather Forecast (ECMWF) in 2017, including hourly estimates of different variables. It contains information such as temperature, humidity, pressure and wind in the specific year of interest [1].
Host distribution data were obtained from the Gridded Livestock of the World (GLW 3) database, a collaboration of the Food and Agriculture Organisation (FAO) and Environmental Research Group Oxford (ERGO). This database provides the distribution layers of bovines, small ruminants, pigs, and poultry derived by multivariate regression [33].

Spatial bias-correction
The occurrence data set did not include data points throughout Italy and is therefore not representative for the whole of Italy. As a result, we masked out the parts of Italy for which the data are not representative using an environmental envelope, also called climatic envelope, which is based on a set of environments in which it is supposed that the species persist, because the environmental needs of the species are satisfied [40]. For this, Mahalanobis distance (MD) was used. Farber and Kadmon [16] showed that using MD resulted in more accurate predictions of species distributions compared to standard envelopes that are rectilinear. Hereafter, only the area within this environmental envelope was used for model development and mapping, to avoid projecting outside of the range of the model input data.

Data extraction
Prior to model development, data were first prepared in VECMAP Ò by extracting environmental covariate data from the sites where D. dendriticum is present and absent using the "Extract data" function. The extract data tool iterates through all the defined environment covariate images in the predictor suite and extracts an environmental value for each data occurrence data point used to develop (train) the model, and data outliers are excluded based on the standard deviation of the predictor values. Second, the extracted data set needs to be balanced when using non-bootstrapping models such as Random Forest. We randomly sampled the largest class to result in an equalisation of the presence and absence points.

Model development Random Forest baseline model -best case scenario (BCS)
A machine learning modelling technique, Random Forest (RF), was applied using VECMAP Ò software, which is based on the randomForest R package [24]. This modelling technique was previously used for modelling bulk-milk tank antibodies against liver fluke at a European scale [10]. First, we computed a baseline Random Forest model using the entire occurrence and environmental covariate data set. The parasitic infection (presence/absence of the parasite in the final host) was used as a proxy for the parasite. The random Forest algorithm was applied to the extracted data (Sect. Data extraction) to group the presence-absence data into clusters based on different ecoclimatic patterns using a recursive partitioning approach (similar to a decision tree). This allows recognition of any pattern in the data [10]. Initially, a model was fitted to the complete dataset (all occurrence and covariate data), specifying 500 replicate trees and 8 environmental variables to be selected at random at each node. Variable importance was then assessed using mean decrease accuracy and mean decrease Gini and a reduced set of 3 environmental variables selected based on their importance (cf . Tables 2 and 3). A second model was then fitted to the reduced set of environmental variables, specifying 100 replicate trees and 6 variables to be selected at random at each node. Model evaluation is based on standard model statistics. These include sensitivity, specificity, Cohen's kappa, and area under curve (AUC). Expert analysis is also used to evaluate the plausibility of the mapped model outputs.

Minimal occurrence data sample size
The minimal number of occurrence data points needed to build a stable model was determined by first starting with the best-case scenario (BCS; described above). This is the maximum number of data points that can be used based on the number of presence and absence points available to assemble a baseline model. Thereafter, RF model replicates were developed as described above, with the occurrence data incrementally reduced by subtracting 10% of the total number of samples at random with one replicate. Model performance statistics were then compared between replicates; when a plateau face was reached in the statistical output, this was the minimal sample size needed to build a stable model. Model performance based on annual historic data We then investigated the possibility of developing annual distribution maps using only the data available for each single year for which data were available. Again, these models were computed using the random Forest machine learning algorithm of the VECMAP Ò software package, and the statistical performance compared against the BCS model and the model using the minimal sample size.

Data exploration
A total of 5131 D. dendriticum occurrence records were available. The distribution of D. dendriticum occurrence records was not spatially homogeneous (Fig. 3). Most of the data were

Baseline model -best case scenario (BCS)
Following initial RF model fitting, a reduced variable set (marked with an asterisk in Table 1) was used for development of subsequent models. Using this reduced variable set, the BCS model predicted an elevated probability of occurrence of D. dendriticum throughout Campania, Calabria, Lazio, Abruzzo, Marche, and Emilia-Romagna regions (Figs. 2 and 4, Table 2).  This reflects the known distribution of D. dendriticum on a broad spatial scale.
Minimal occurrence data sample size The RF model outputs are given in Table 2. A clear drop of all statistics is observed at À70% of the data points. This is referred to hereafter as the "cut-off" and corresponds to 758 presence-and 758 absence samples. The "cut-off" model using this reduced dataset of 30% of the occurrence data (Fig. 5) yielded similar spatial predictions to the BCS model, with the exception of a slightly elevated risk in Marche and Emilia-Romagna (Fig. 4, cf. Fig. 5).
With the exception of Bio11 (Mean Temperature of Coldest Quarter), which was identified as one of the 3 most important variables in all models except the model using only 10% of the occurrence data, variable importance varied according to the occurrence data subset used (Table 2).

Model performance based on annual historic data
No individual year reached the cut-off value of 758 presence samples. Therefore, we decided to model the 5 years that contained the highest number of data: 2009 (n = 178), 2012 (n = 165), 2013 (n = 135), 2015 (n = 120), and 2016 (n = 415). Model performance and variable importance varied according to the occurrence data subset used (Table 3). Fully mapped details were provided only for 2012 and 2016, the most interesting input for a discussion, in order to avoid overloading this paper.
The statistical output of 2012 with 165 presence data samples, showed that overall higher values than the cut-off are observed. The statistical output is almost equal to the baseline model ( Table 3). The data exploration map of 2012 (Fig. 6) showed that most of the data are located in the Campania and Basilicata regions. The model output of 2012 (Fig. 7) broadly reflects the BCS model (Fig. 4), predicting 0.5-1 probability of presence zones from the Po valley down to Calabria. Puglia, Sicily, and Sardinia are low-prediction regions. The statistical output of 2016 (Table 3), 415 samples, showed lower values compared to the baseline model. Except for AUC, the statistical values are approximately equal to the cut-off. The data exploration map of 2016 (Fig. 8) showed that most of the data are located in the Campania and Basilicata regions. A clustered zone of samples can also be observed in these regions. Despite the good statistical performance, the central-northern regions which have previously reported D. dendriticum infections (Fig. 3), and were previously identified as having elevated probability of presence using the BCS model (Fig. 4), are predicted to have a low probability of presence using only 2012 data. Compared to the other individual year models an overall low prediction zone is predicted throughout whole Italy in 2016 (Fig. 9). In Calabria and parts of Campania and Basilicata 0.75 predicted probability of presence zones are observed.
The results obtained for the other three years (maps not shown here) are summarised below.
The statistical output of 2009 (Table 3), with 178 samples, showed that compared to the statistical output of the cut-off (À70%), the values are approximately equal but still lower than the baseline model. The data exploration map of 2009 shows that most of the presence/absence points are located in the Campania and Basilicata regions. In the model output of 2009, we can observe 0.25-0.5 presence zones throughout Italy. In Marche, 0.75 predictive zones are observed and in Campania and Basilicata, high prediction (0.75-1) zones are observed.
The statistical output of 2013 (Table 3), 135 samples, showed that overall, except for Kappa, the statistical values are in the same range as the baseline model. The data exploration map of 2013 shows that most of the data are located in the Campania and Basilicata regions. Compared to 2012, the same pattern is seen in presence prediction but an overall higher prediction rate is observed. Marche is predicted as a high (1) presence region. Parts of Apulia, Sicily, and Sardinia had lower (0.25-0.5) prediction zones.
The statistical output of 2015 (Table 3), 120 samples, showed that the statistical values are almost equal to the baseline model. The data exploration map of 2015 shows that most of the data are located in the Campania and Basilicata regions. Again, the same pattern is observed in the model output starting from the Po valley to Calabria, not including the main part of Puglia and Sicily. Sardinia has a higher prediction zone (0.5-0.75) compared to the model from 2013. A higher prediction zone is seen around Marche, compared to 2013.

Discussion
The first step in the spatial modelling process is planning and gathering a presence-absence data set. In this research, a historical data set for Italy covering the period from 1999 to 2018 was provided by CREMOPAR. The data were obtained using both active and passive surveillance. The advantages of having access to such a large data set are obvious, but one of the main disadvantages is that the data were not specifically collected for developing spatial models. It seems clear that this results in a sampling bias: (i) some areas are oversampled (especially in central and southern regions); and (ii) there are large geographical gaps. If this bias is not corrected the spatial distribution model (SDM) outputs may display the sampling effort rather than the real species distribution [38]. In this study, this issue was partially solved by computing a mask based on the environmental envelope of D. dendriticum that excluded the pixels that were not representative for the sampled pixels which will be discussed later in this section.
The second step is data exploration. An important observation when exploring the occurrence data of dicrocoeliosis was a clustering of samples in some areas of southern Italy. The reason for this is probably mainly the active monitoring programme carried out by CREMOPAR. As a result, the area surrounding the CREMOPAR is overrepresented as compared to other sampled areas. This should not be confused with a clustered distribution pattern of this parasite, as reported in other studies [4,6,11,29] where clustered areas of presence were observed within larger sampled, but negative, areas. In these surveys, the clustering observed was likely due to the specific eco-climatic condition required by intermediate hosts of D. dendriticum to develop.
When making exploratory models, it is important to first determine the minimal sample size in order to model a reliable predictive map. Mateo et al. [26] showed that generated models are influenced by sample size and prevalence. The predictive power of a model increases when information is added. This applies until the statistical values reach a "plateau". From then, model performance is not considerably enhanced when additional data are added. When sample size decreases, the accuracy, reliability, and stability of the model should decrease as well. Therefore, it can be concluded that in order to generate a robust model a minimum sample size, and more specifically a minimum number of presence points, is needed. Also, SDMs are used to limit the sampling effort. Hence, setting a minimum sample size allows production of precise SDMs, without wasting expensive resources [37]. No individual annual subset of the occurrence data reached the cut-off determined from the baseline models' statistics. Therefore, the top 5 of most samples throughout the individual years were selected to do further modelling as a proof of concept.

Baseline model
For this research, we aimed to model the presence/absence of the parasite detected in the final host, sheep, using eco-climatic environmental predictor data. These are expected to show the strongest relationship with the distribution of the intermediate host and free-living stages of the parasite. We hence used predictor data that indirectly influence infection presence. In this case, an absence of data means that the infection did not occur when the diagnosis was made. Therefore, more presence data are needed to model a reliable predictive map. Hendrickx [20] showed that separately modelling a vector, a disease and its main symptom (e.g. anaemia) using eco-climatic predictor data, systematically resulted in a lower level of accuracy of the disease model for the same number of samples. This was explained by the fact that other factors than climatic data affect the distribution of a disease in a host, and that many different diseases may affect an observed symptom such as anaemia.
Indirect measures of parasite presence were used in this study as a proxy for the presence of the parasite. Copromicroscopic analysis for the presence of parasite eggs is the most widely used diagnostic procedure for D. dendriticum and other helminths [9], where accurate detection of the parasite in the environment is difficult (e.g. liver fluke metacercariae on pasture), and more direct measures of parasite presence in the host are invasive (e.g. post mortem to confirm parasite presence). Copromicroscopic techniques are advantageous as they allow for larger sample sizes than would be possible with more invasive or laborious sampling methods. However, these approaches may produce false-negative results where patent infections are not always detectable. In our case, we used eco-climatic data to model infection data, whilst other factors such as farm management and the presence of the intermediate host will also affect it. It is also very likely that disease management strategies may vary greatly over such a wide geographical range. The fact that these could not be included as co-variates may affect the quality of the model outputs. Nevertheless, the developed models performed well both statistically and qualitatively, showing that whilst not including such co-variate data may affect the reliability of identifying causal factors, this does not necessarily affect the efficiency of pattern matching that this type of modelling implements. The models therefore provide a good basis for further exploration of sample size requirements and the impact of sample subset on model performance.
The produced model output using the entire observed presence data set yielded an overall satisfactory result. In both the southern and central parts of Italy, the model provides satisfactory spatial detail as a valid tool for further field work towards refining the knowledge about the distribution patterns of this parasite. The lack of information in the most northern part of Italy is partially solved by removing non-representative areas from the modelling process.
Interestingly, when reducing the sample size (À70% model), whilst the general distribution pattern remains the same, this results in a loss of spatial detail. Here, the southern third of Italy remains very similar to the full model output, in the central part the high-risk (high probability of presence) areas increase in size, and in the northern third, there is also a strong shift towards a higher category. This suggests that the negative effect of an uneven spatial distribution mainly affects smaller sample sizes, as less variation within a sample may reflect less variation in the output.

Annual distribution models
When developing models for individual years, the Bioclim data for temperature (from Bio01 to Bio11) were replaced with data derived from ERA5 for this variable. ERA5, developed in 2017, provides more precise data because the data are registered hourly [1] compared to Bioclim, developed in 2005, that provides monthly climatic data based on long-term averages   [21]. This allows us to take into consideration specific climatic conditions prevailing in each year.
The individual year models confirm the statement that sample size affects model performance [26]. Though no individual year reached the sample size cut-off and results for annual models were variable, the quality of the outputs obtained for the years with the highest number of observations is encouraging. The output statistics differ for each year, but there is no clear decrease or increase in the statistics. The best prediction rate and statistical output were obtained for 2012 (Fig. 7) and the worst for 2016 (Fig. 9), even though the former sample size was significantly smaller than the latter. This can be explained by looking at the distribution of D. dendriticum in each individual year. The statistics of 2012 (Table 3) are improved compared to the cut-off (À70%) model. The mapped model output (Fig. 7) shows a similar pattern for southern and central Italy. For this model, the input data ( Fig. 6) are more evenly geographically distributed throughout Italy, and there is no dense data cluster in the southern third of the country. The statistics for 2016 (Table 3) are worse compared to the cutoff. For this year, a major portion of the data were strongly clustered in the southern part of Italy (Fig. 8). The mapped model output also shows a very different spatial distribution pattern in the central and northern parts of Italy; this is unlikely to occur because of climatic differences between years. As a result, the 2016 model failed to identify the central and northern region of Italy as suitable for D. dendriticum. This may also be affected by our choice to use the same full-data suitability mask when computing the annual models. In future work, we will further test the effect of using different suitability masks for each year.
Syfert et al. [38] showed that the prediction accuracy of models made with spatially clustered data is inferior to that with models which are not clustered. To overcome this, it may be possible to filter the database to reduce spatial autocorrelation, resulting in a data set with, for instance, maximum one record per km 2 cell. The impact of selecting maximum or mean values per pixel will be explored in further work. However, in this case, this would not solve the issues that (a) no year made the cut-off of minimal sample size to model a sufficiently stable predictive model, and (b) geographical gaps in the data set are too large.
Appropriate knowledge and robust experience on a parasite and its intermediate hosts are required to interpret data sets for their suitability when modelling. The open availability of many species occurrence datasets (e.g. https://www.gbif.org) and the apparent ease of implementing basic species distribution models, make species distribution modelling attractive, without considering the quality of the data used to develop such models. Our results simultaneously highlight the potential opportunities for modelling parasite distributions using longitudinal datasets of indirect measures of presence (diagnostic data), and the limitations of highly clustered data with a limited temporal range. Given the weaknesses of our data set, discussed above, the obtained results suggest that the proposed approach may contribute to highlight differences between years provided that the input data set is more evenly geographically distributed, and that additional predictor variables reflecting non-environmental factors, such as farm management, affecting the presence of the infection, are identified and available at sufficient spatial detail.
In conclusion, the spatial distribution of the input data appears to be more important than the actual sample size when computing species distribution models. This is often a major issue when using historical data for developing spatial models. Such data sets often include sampling biases and large geographical gaps. If this bias is not corrected, the SDM outputs may display the sampling effort rather than the real species distribution.

ODMAP element Contents Assumptions
We assumed that: -Diagnostic data were representative of presence or absence of infection in the host. -Sensitivity of diagnostic data does not change in space or time.
-The chosen environmental covariates represent all relevant environmental drivers of distribution.
-The data encompass the species realised niche in the area modelled (after bias-correction -see below).
-Sample selection bias is adequately corrected (see below).

SDM algorithms
Algorithms: Random Forestthis method was chosen because of experience with the model and good performance in previous modelling exercises. Model complexity: 500 trees with 8 variables. Ensembles: Not applicable (except for bagging implicit in the Random Forest algorithm).

Model workflow
After preparation of environmental covariates, removal of errors and bias-correction (see below), Random Forest models were fitted to the full dataset, and to a reduced set of covariates identified as important in the full model. This process was repeated for incrementally increasing sample sizes (see portioning information below) to identify the minimal sample size, below which statistical performance deteriorates. Models were also fitted using the same process to annual occurrence data. Software Software: Environmental variable processing was completed in R v3.4.3 (R Core Team, 2017 [43]). Mapping and Random Forest modelling were completed in VECMAP Ò (https://www.avia-gis.com/vecmap).

Data Biodiversity data
Taxon names: Dicrocoelium dendriticum. Ecological level: Species. Data source: CReMoPAR, a parasitological reference lab from the Naples (Italy) area. Diagnostic data collected 1999-2018. Sampling design: Samples submitted from throughout Italy for diagnosis (faecal egg counts), opportunistic samples collected in the region surrounding CReMoPAR. Sample size: 5131 occurrences. Regional mask: Data were clipped to the Italian boundary. Scaling: Not applicable. Background data: Not applicable. Errors and biases: Parts of Italy not represented in the dataset were masked using an environmental envelope generated using a Mahalanobis distance approach. The area within this environmental envelope was used for model development to avoid projecting model predictions outside of the range of the occurrence data. The data set was balanced by randomly subsampling the largest class. Data partitioning A model was developed using the full datasets to demonstrate the "best-case scenario" (BCS), before reducing the size of occurrence dataset in 10% increments at random, to evaluate the impact of sample size on model performance. Models were also fitted to data for the 5 years between 1999 and 2018 with the highest occurrence data sample size to evaluate the impact of dataset on model performance.
Spatial resolution and extent of the raw data: The livestock density data were available at a 10 km resolution.
Bioclimatic data were available at a 1 km resolution. NDVI data were available at a 1 km resolution. Geographic projection: WGS84. Temporal resolution and extent of the raw data: The livestock density data represent predicted livestock density for 2011. Bioclimatic variables and NDVI data were averaged for the temporal extent of the occurrence dataset (1999-2018). Data processing: The extent of all data were clipped to the Italian boundary before processing. Resampling/ aggregation was not performed to standardise resolution.

Model
Variable pre-selection The choice of initial covariates was made as a compromise between availability and ecological/biological relevance to the study species. Only weakly correlated covariates were included in the models.
(Continued on next page)