Modelling age-heterogeneous Schistosoma haematobium and S. mansoni survey data via alignment factors

Background Reliable maps of the geographical distribution, number of infected individuals and burden estimates of schistosomiasis are essential tools to plan, monitor and evaluate control programmes. Large-scale disease mapping and prediction efforts rely on compiled historical survey data obtained from the peer-reviewed literature and unpublished reports. Schistosomiasis surveys usually focus on school-aged children, whereas some surveys include entire communities. However, data are often reported for non-standard age groups or entire study populations. Existing geostatistical models ignore either the age-dependence of the disease risk or omit surveys considered too heterogeneous. Methods We developed Bayesian geostatistical models and analysed existing schistosomiasis prevalence data by estimating alignment factors to relate surveys on individuals aged ≤ 20 years with surveys on individuals aged > 20 years and entire communities. Schistosomiasis prevalence data for 11 countries in the eastern African region were extracted from an open-access global database pertaining to neglected tropical diseases. We assumed that alignment factors were constant for the whole region or a specific country. Results Regional alignment factors indicated that the risk of a Schistosoma haematobium infection in individuals aged > 20 years and in entire communities is smaller than in individuals ≤ 20 years, 0.83 and 0.91, respectively. Country-specific alignment factors varied from 0.79 (Ethiopia) to 1.06 (Zambia) for community-based surveys. For S. mansoni, the regional alignment factor for entire communities was 0.96 with country-specific factors ranging from 0.84 (Burundi) to 1.13 (Uganda). Conclusions The proposed approach could be used to align inherent age-heterogeneity between school-based and community-based schistosomiasis surveys to render compiled data for risk mapping and prediction more accurate.


Background
An estimated 200 million individuals are infected with Schistosoma spp. in Africa, and yet schistosomiasis is often neglected [1]. The global strategy to control schistosomiasis and several other neglected tropical diseases (NTDs) is the repeated large-scale administration of anthelminthic drugs to at-risk populations, an approach phrased 'preventive chemotherapy' [2,3]. The design, implementation, monitoring and evaluation of schistosomiasis control activities require knowledge of the geographical distribution, number of infected people and disease burden at high spatial resolution.
In the absence of contemporary surveys, large-scale empirical risk mapping heavily relies on analyses of historical survey data. For example, Brooker et al. [4] compiled survey data and presented schistosomiasis (and soil-transmitted helminthiasis) risk maps within the global atlas of helminth infections (GAHI) project (http:// www.thiswormyworld.org/). The GAHI database, however, is not fully open-access, and country-specific predictive risk maps only show probabilities of infection prevalence below and above pre-set thresholds where preventive chemotherapy is warranted (e.g. > 50% of school-aged children infected, which demand annual deworming of all school-aged children and adults considered to be at risk) [2]. Starting in late 2006, the European Union (EU)-funded CONTRAST project developed a global database pertaining to NTDs, the GNTD database (http://www.gntd.org) [5]. This openaccess database compiled raw survey data from published (i.e. peer-reviewed literature) and unpublished sources (e.g. Ministry of Health reports). It is continuously updated and data can be downloaded as soon as they are entered in the database. In early 2011, the GNTD database consisted of more than 12,000 survey locations for schistosomiasis in Africa [5]. The database has already been utilised for high-spatial resolution schistosomiasis risk mapping and prediction in West Africa [6] and East/southern Africa.
An important drawback of data compilation is the lack of homogeneity and comparability between surveys, such as target population (different age groups), time of survey, diagnostic method employed, among other issues. The GNTD database is populated with schistosomiasis prevalence surveys conducted in schools, as well as in entire communities, involving different, sometimes overlapping age-groups [5]. However, each population subgroup carries a different risk of infection, with schoolaged children and adolescence known to carry the highest risk of infection [7,8]. Simple pooling of this type of studies is likely to result in incorrect disease risk estimates.
Schistosomiasis survey data are correlated in space because the disease transmission is driven by environmental factors [9][10][11]. However, standard statistical modelling approaches assume independence between locations, which could result in inaccurate model estimates [12]. Geostatistical models take into account potential spatial clustering by introducing location-specific random effects and are estimated using Markov chain Monte Carlo (MCMC) simulations [13]. Geostatistical models have been applied on compiled survey data for disease risk prediction, for example in malaria [14][15][16] and helminth infections, including schistosomiasis [6,17].
Age-heterogeneity of survey data has been addressed in geostatistical modelling by omitting those surveys which consist of particularly heterogeneous age-groups [6,15]. As a result, the number of survey locations included in the analysis is reduced, and hence model accuracy is lowered, especially in regions with sparse data. Gemperli et al. [18] used mathematical transmission models to convert age-heterogeneous malaria prevalence data to a common age-independent malaria transmission measure. This approach has been further developed by Gosoniu [19] and Hay et al. [16]. To our knowledge, the age-heterogeneity problem has yet to be investigated in schistosomiasis.
In this paper, we developed Bayesian geostatistical models, which take into account age-heterogeneity by incorporating alignment factors to relate schistosomiasis prevalence data from surveys on individuals aged ≤ 20 years with surveys on individuals > 20 years and entire communities. Different models were implemented assuming regional and country-specific alignment factors. The predictive performance of the models was assessed using a suite of model validation approaches. Our analysis is stratified for Schistosoma haematobium and S. mansoni with a geographical focus on eastern Africa.

Disease data
Prevalence data of S. haematobium and S. mansoni from 11 countries in eastern Africa were extracted from the GNTD database. We excluded non-direct diagnostic examination techniques, such as immunofluorescence tests, antigen detections or questionnaire data. Hospitalbased studies and data on non-representative groups, such as HIV positives, are not part of the GNTD database [5].
The remaining data were split into three groups and stratified for the two Schistosoma species according to study type. The three groups correspond to surveys on (i) individuals aged ≤ 20 years, (ii) individuals > 20 years and (iii) entire community surveys. In case a survey contained prevalence data on multiple age groups, we separated the data according to groups (i) and (ii).
Preliminary analyses suggested only weak temporal correlation in the data for either Schistosoma species. Hence, spatial models instead of spatio-temporal models were fitted in the subsequent analyses employing the study year only as a covariate. We grouped the study years as follows: surveys conducted (i) before 1980; (ii) between 1980 and 1989; (iii) between 1990 and 1999; and (iv) from 2000 onwards.

Environmental data
Freely accessible remote sensing data on climatic and other environmental factors were obtained from different sources, as shown in Table 1. Data with temporal variation were obtained from launch until the end of 2009 and summarised as overall averages for the available period. Estimates for day and night temperature were extracted from land surface temperature (LST) data. The normalized difference vegetation index (NDVI) was used as a proxy for vegetation. Land cover categories were restructured into six categories: (i) shrublands and savannah; (ii) forested areas; (iii) grasslands; (iv) croplands; (v) urbanized areas; and (vi) wet areas. Digitized maps of rivers and lakes were combined as a single freshwater map covering the study area.
Characteristics on perennial and seasonal water bodies at each survey location were obtained using the spatial join function of ArcMap version 9.2. In addition, the minimum distance between the locations and the closest freshwater source was calculated with the same function.
All data were used as covariates for modelling. Continuous covariates were categorized based on quartiles in order to account for potential non-linear outcomepredictor relations. Processing and extraction of the climatic and environmental data at the survey locations was performed in ArcMap version 9.2, IDRISI 32 and the Modis Reprojection Tool.

Geostatistical model formulation and age-alignment
Let Y i be the number of infected individuals and N i the number of individuals screened at location i (i = 1,..., n). We assumed that Y i arises from a Binomial distribution, i.e. Y i~B in(p i ,N i ), with probability of infection.p i We introduced covariates X i on the logit scale, such as log it(p i ) = X T i β, where β is the vector of regression coefficients. Unobserved spatial variation can be modelled via additional location-specific random effects, i . We assumed that ϕ = (ϕ 1 , . . . , ϕ n ) T arises from a latent stationary Gaussian spatial process, ϕ ∼ MVN(0, σ 2 R) with correlation matrix R modelling geographical dependence between any pairs of locations i and j via an isotropic exponential correlation function, defined by R ij = exp(-rd ij ), where d ij is the distance between i and j, ρ a correlation decay parameter and σ 2 the spatial variance. A measurement error can also be introduced via location-specific non-spatial random effects, ε i , such as ε i~N (0, τ 2 ), with non-spatial variance τ 2 .
We aligned the risk measured by the different types of studies by incorporating a factor a s such that Y is~B in(q i, s, N i,s ), with q i,s = a s p i and s = 1 (surveys with individuals aged ≤ 20 years); s = 2 (surveys with individuals aged > 20 years); and s = 3 (entire community surveys). Schoolaged children carry the highest risk of Schistosoma infection, and hence many studies focus on this age group. We set α 1 = 1 in order to use the probability of infection for individuals aged ≤ 20 years as baseline and to align the other groups to this designated baseline.
To complete Bayesian model formulation, we assumed non-informative priors for all parameters. Normal prior distributions with mean 0 and large variance were used for the regression coefficients, β . Non-informative Gamma distributions with mean 1 were assumed for the variance parameters, s 2 , τ 2 and the alignment factors a s , while a uniform distribution was implemented for the spatial decay parameter r.
Models were developed in OpenBUGS version 3.0.2 (OpenBUGS Foundation; London, UK) and run with two chains and a burn-in of 5000 iterations. Convergence was assessed by inspection of ergodic averages of selected model parameters and history plots. After convergence, samples of 500 iterations per chain with a thinning of 10 were extracted for each model resulting in a final sample of 1000 estimates per parameter.

Model types
We implemented four different models, separately for S. haematobium and S. mansoni. The models varied based on different features. The first feature was the underlying data. Model A only consisted of schistosomiasis prevalence data on individuals aged ≤ 20 years (s = 1), while models B-D included data on all three kinds of study types (s = 1,2,3). The second feature was the introduction of alignment factors for disease risk modelling. Model C assumed common alignment factors across the entire study region, while model D assumed country-specific alignment factors.

Model validation
Validation for each model was carried out to identify the model with the highest predictive ability for either Schistosoma species and to compare models with and without alignment factors. All models were fitted on a subset of the data (training set) and validated by comparing the posterior median of the predicted risk p * j with the observed risk P j for the remaining set of the data (test set, j = 1,...,m, m <n). The test set consisted of 20% of the locations from the dataset on individuals aged ≤ 20 years and was congruent over all models.
Comparisons of predicted vs. observed risk were based on three different validation approaches. Mean absolute errors (MAE) calculate the absolute difference between observed and predicted schistosomiasis risk by An alternative way to quantify . The best predicting model based on these two methods is the model with smallest MAE and χ 2 estimates and therefore with predictions closest to the observed values.
The proportion of the test data being correctly predicted within the q-th Bayesian credible interval (BCI q ) of the posterior predictive distribution is calculated by

Results
Schistosomiasis prevalence data Figure 1 shows the distribution of the observed schistosomiasis prevalence data over the study region, stratified by study type. An overview of the amount of observed data and mean prevalence levels per country for either Schistosoma species, stratified by survey period and diagnostic methods, is given in Table 2. Some countries (e.g. Kenya and Tanzania), contain large numbers of survey locations, while other countries, such as Burundi, Eritrea, Rwanda, Somalia and Sudan, are not well covered. Burundi and Rwanda do not include any locations for S. haematobium, and Rwanda contains only four surveys on individuals aged > 20 years for S. mansoni. As expected, there were more surveys carried out with individuals aged ≤ 20 years than surveys focussing on adult populations or entire communities. The mean prevalence per country for surveys on individuals aged ≤ 20 years varies between 0% (Eritrea) and 53.9% (Malawi) for S. haematobium and between 0% (Somalia) and 61.6% (Sudan) for S. mansoni. We found an overall mean prevalence of S. haematobium and S. mansoni of 32.8% and 23.2%, respectively. Community surveys usually showed higher mean prevalence levels. However, the survey locations might not be the same among the different types of studies and therefore the observed prevalence levels are not directly comparable.    Two-third of the S. haematobium survey data were obtained before the 1990s (66.5%), while few surveys were compiled from 2000 onwards (16.2%). On the other hand, S. mansoni surveys were mainly conducted in the 1980s (32.7%) and from 2000 onwards (29.8%), whereas only 15.9% of the surveys were carried out in the 1990s. The distribution of surveys within the different time periods varies from country to country and between the two Schistosoma species. While some countries (e.g. Eritrea and Somalia) only have surveys for one or two periods, other countries (e.g. Kenya, Tanzania and Zambia) are well covered over time. The data also vary in the diagnostic methods. For example, even though 67.4% of the S. mansoni surveys with known diagnostic methods employed the Kato-Katz thick smear method, in Somalia and Eritrea only stool concentration methods (e.g. Ritchie technique or ether-concentration technique) were used.

Model validation
For S. haematobium, model validation based on the MAE measure (Table 3) showed no difference between disease risk modelling on individuals aged ≤ 20 years (model A) and unaligned modelling of all three survey types (model B), while the χ 2 measure led to improved predictions. The introduction of regional alignment factors in spatial modelling based on all survey types (model C) further enhanced model predictive ability based on the MAE and χ 2 measures. Model D, including country-specific alignment factors, showed similar predictive performance as model B. Validation based on different BCIs demonstrated that the proportion of correctly predicted test locations was similar among all models. Model A predicted most test locations correctly within the 95% BCI, while model C was superior for 50% BCIs and model D for 70% BCIs. Regardless of the model used, average BCI widths were comparable.
For S. mansoni, model predictive performance in terms of MAE and χ 2 measures was best for model C, followed by models B and D. The differences among the models for the BCI method were small and not consistent between the examined BCIs. For example, at 70% BCI, model A included least of the test locations, while  at 95% BCI, this model correctly predicted most of the test locations but the averaged width of the BCI was widest.

Alignment factors
Regional and country-specific schistosomiasis risk alignment factors for S. haematobium and S. mansoni are presented in Table 4. Some countries had insufficient data, and hence country-wide alignment factors could not be estimated. A mean regional alignment factor of 0.83 (95% BCI: 0.81-0.85) confirmed that the risk of S. haematobium in individuals aged ≤ 20 years is greater than in individuals > 20 years. S. haematobium risk estimation from entire community survey was related to the risk of individuals aged ≤ 20 years with 0.91 (95% BCI: 0.90-0.93). Mean country-specific alignment factors varied from 0.62 (Ethiopia) to 1.26 (Zambia) among individuals > 20 years and from 0.79 (Ethiopia) to 1.06 (Zambia) in entire communities. In Ethiopia and Sudan, the country-specific alignment factors were significantly smaller than the overall alignment factor, whereas in Somalia and Zambia, country-specific factors were significantly larger. For S. mansoni, the mean regional alignment factor among individuals aged > 20 years was 0.94 (95% BCI: 0.92-0.96), while country-specific estimates varied from 0.64 (Zambia) to 1.18 (Tanzania). In community surveys, the regional alignment factor was 0.96 (95% BCI: 0.95-0.98) with country-specific alignment factors between 0.84 (Burundi) and 1.13 (Uganda). Significantly smaller country-specific alignment factors compared to the overall alignment factor were found in Burundi, Ethiopia and Zambia, while significantly larger factors were obtained for Kenya, Tanzania and Uganda.
The regional alignment factor estimates for S. haematobium compared to S. mansoni are much lower, e.g. 17% risk reduction for individuals aged > 20 years vs. 6% risk reduction. This relation is also found in country-specific estimates, except for Zambia.

Discussion
In this study, we derived factors to align schistosomiasis prevalence estimates from age-heterogeneous surveys across an ensemble of 11 countries in eastern Africa. We found correction factors that are significantly different from 1. As a result, geostatistical model-based predictions from school-based and community-based surveys are further enhanced. The estimates of the regional alignment factors confirm that individuals aged ≤ 20 years are at a higher risk of a Schistosoma infection than adults [7,8,20]. Interestingly, the alignment factor estimates for S. haematobium were slightly lower than those for S. mansoni. This finding might be explained by differences in the age-prevalence curves between the two species. S. haematobium prevalence usually peaks in the age group 10-15 years [21], while the peak of S. mansoni prevalence occurs somewhat later, up to the age of 20 years [22]. Consequently, there is a larger difference in infection risk between children and adults for S. haematobium compared to S. mansoni. Additionally, the peak of S. mansoni prevalence might be further shifted towards older age groups due to the so-called peak shift. Indeed, it has been shown that the peak of infection prevalence is more flat and reaches its maximum in older age groups if transmission is low-tomoderate, while prevalence peaks are higher and they are observed at a younger mean age if transmission is high [7]. Several African countries have implemented large-scale preventive chemotherapy programmes against schistosomiasis [3,23]. These programmes reduced schistosomiasis-related morbidity [24] and might have had some impact on transmission [25,26]. It is therefore conceivable that the peak of Schistosoma infection might slightly shift to older age groups. It should also be noted that, disparities in the spatial risk distribution of the two Schistosoma species and in the implementation of control strategies in these areas could have led to differences in the alignment factors.
Considerable differences between country-specific alignment factors and prevalence ratios based on the raw data were found for Ethiopia, Tanzania, Uganda and Zambia in S. haematobium, and for Burundi and Zambia in S. mansoni. These differences are mainly due to the spatial distribution of the survey locations, which vary between age groups. For example, surveys focussing on individuals aged ≤ 20 years are located in central and eastern Zambia, while surveys on individuals > 20 years in Zambia are mainly located in the north of the country. The north is characterised by lower schistosomiasis transmission risk. Therefore, the crude prevalence ratio between the two groups is artificially small, while the alignment factor, which is based on the predicted prevalence risk in this area, is much higher. Model validation showed that regional alignment factors improved predictive performance of the models for both Schistosoma species, however, country-specific alignment factors did not further improve the models. The predictive performance of the model with regional factors was good, as 79.4% and 83.8% of the test locations were correctly predicted within 95% BCIs for S. haematobium and S. mansoni, respectively. All models estimated relatively wide BCIs, indicating large variation in the data that could not be explained by the model covariates. Socioeconomic and health system factors might play a role in the spatial distribution of schistosomiasis, however these data do not exist at high spatial distribution for the entire study area, and hence could not be used for model fit and prediction. Part of the variation might have arisen by the model assumptions of stationarity and isotropy and the heterogeneity in the diagnostic methods.
The proposed alignment factor approach is scaling the predicted prevalence of schistosomiasis and leads to an easy interpretation of the parameters. In addition, it allows defining meaningful prior distributions, and hence resulting in better model convergence. An alternative way to include age in the models is to introduce age as a covariate. This approach is scaling the odds instead of the prevalence. Preliminary analyses preformed by the authors, on the same data using age as covariate, resulted in serious model convergence problems, leading to the implementation of age alignment factors as proposed in this manuscript.
A limitation of our work is the assumption of constant disease risk within each age group. This is not true especially for school-aged children for whom the schistosomiasis risk reaches a maximum at around 11-14 years. A more rigorous model formulation should take into account the age-prevalence curve and standardise the surveys using a mathematical description of this curve. Raso et al. [27] derived a Bayesian formulation of the immigration-death model to obtain age-specific prevalence of S. mansoni from age-prevalence curves. We are currently exploring geostatistical models, coupled with mathematical immigration-death models, to fully consider the age-dependence of the schistosomiasis risk.

Conclusions
We have shown that age-alignment factors should be included to improve prevalence estimates of populationbased risk of schistosomiasis, especially for large-scale modelling and prediction efforts. Indeed, large-scale modelling cannot be achieved without compilation of primarily historical survey data assembled over large study areas using different study designs and age groups. The proposed alignment factor approach can be used to relate the most frequent survey types, i.e. studies focussing on individuals aged ≤ 20 years (mainly school surveys) with studies on individuals aged > 20 years and entire communities. Un-aligned survey compilation leads to imprecise disease risk estimates and potentially wrong recommendations to decision makers for the implementation of control activities and subsequent monitoring and evaluation.