Bayesian geostatistical modelling of soil-transmitted helminth survey data in the People’s Republic of China

Background Soil-transmitted helminth infections affect tens of millions of individuals in the People’s Republic of China (P.R. China). There is a need for high-resolution estimates of at-risk areas and number of people infected to enhance spatial targeting of control interventions. However, such information is not yet available for P.R. China. Methods A geo-referenced database compiling surveys pertaining to soil-transmitted helminthiasis, carried out from 2000 onwards in P.R. China, was established. Bayesian geostatistical models relating the observed survey data with potential climatic, environmental and socioeconomic predictors were developed and used to predict at-risk areas at high spatial resolution. Predictors were extracted from remote sensing and other readily accessible open-source databases. Advanced Bayesian variable selection methods were employed to develop a parsimonious model. Results Our results indicate that the prevalence of soil-transmitted helminth infections in P.R. China considerably decreased from 2005 onwards. Yet, some 144 million people were estimated to be infected in 2010. High prevalence (>20%) of the roundworm Ascaris lumbricoides infection was predicted for large areas of Guizhou province, the southern part of Hubei and Sichuan provinces, while the northern part and the south-eastern coastal-line areas of P.R. China had low prevalence (<5%). High infection prevalence (>20%) with hookworm was found in Hainan, the eastern part of Sichuan and the southern part of Yunnan provinces. High infection prevalence (>20%) with the whipworm Trichuris trichiura was found in a few small areas of south P.R. China. Very low prevalence (<0.1%) of hookworm and whipworm infections were predicted for the northern parts of P.R. China. Conclusions We present the first model-based estimates for soil-transmitted helminth infections throughout P.R. China at high spatial resolution. Our prediction maps provide useful information for the spatial targeting of soil-transmitted helminthiasis control interventions and for long-term monitoring and surveillance in the frame of enhanced efforts to control and eliminate the public health burden of these parasitic worm infections.


Background
Soil-transmitted helminths are a group of parasitic nematode worms causing human infection through contact with parasite eggs (Ascaris lumbricoides and Trichuris trichiura) or larvae (hookworm) that thrive in the warm and moist soil of the world's tropical and subtropical countries [1]. More than 5 billion people are at risk of soil-transmitted helminthiasis [2]. Estimates published in 2003 suggest that 1,221 million people were infected with A. lumbricoides, 795 million with T. trichiura and 740 million with hookworms [3]. The greatest number of soil-transmitted helminth infections at that time occurred in the Americas, the People's Republic of China (P.R. China), East Asia and sub-Saharan Africa [4]. Socioeconomic development and large-scale control efforts have lowered the number of people infected with soil-transmitted helminths in many parts of the world [1]. For the year 2010, the global burden due to soil-transmitted helminthiasis has been estimated at 5.2 million disability-adjusted life years [5].
In P.R. China, there have been two national surveys for parasitic diseases, including soil-transmitted helminthiasis. Both surveys used the Kato-Katz technique as the diagnostic approach, based on a single Kato-Katz thick smear obtained from one stool sample per individual. The first national survey was conducted from 1988 to 1992 and the second in 2001-2004. In the first survey, there were a total of 2,848 study sites with approximately 500 people examined per site. The survey indicated overall prevalences of 47.0%, 18.8% and 17.2% for A. lumbricoides, T. trichiura and hookworm infections, respectively, corresponding to 531 million, 212 million and 194 million infected people, respectively [6]. The second survey involved 687 study sites and there were 356,629 individuals examined overall. Analyses of the data revealed considerably lower prevalences for soil-transmitted helminth infections than in the first survey; A. lumbricoides, hookworm and T. trichiura prevalences were 12.7%, 6.1% and 4.6%, respectively [7]. However, interventions were less likely to reach marginalized communities in the poorest areas [8] and the diseases re-emerged whenever control measures were discontinued [9,10]. To overcome the challenge of parasite infections in P.R. China, in 2005, the Chinese Ministry of Health issued the "National Control Program on Important Parasitic Diseases from 2006 to 2015" with its target to reduce the prevalence of helminth infections by 70% by the year 2015 [8]. The key strategy for control was large-scale administration of anthelminthic drugs in high prevalence areas, especially targeting school-aged children and people living in rural areas [9,11].
Maps depicting the geographical distribution of the disease risk can aid control programmes to deliver cost-effective interventions and assist in monitoring and evaluation. The Coordinating Office of the National Survey on the Important Human Parasitic Diseases in P.R. China [7] obtained prevalence maps by averaging the data of the second national survey within each province. To our knowledge, high-resolution, modelbased maps using available national survey data are not available to date in P.R. China. Model-based geostatistics predict the disease prevalence at places without observed data by quantifying the relation between the disease risk at observed locations with potential predictors such as socioeconomic, environmental, climatic and ecological information, the latter often obtained via remote sensing. Model-based geostatistics have been used before to map and predict the geographical distribution of soil-transmitted helminth infections in Africa [12,13], Asia and Latin America [14][15][16]. Model-based geostatistics typically employ regression analysis with random effects at the locations of the observed data. The random effects are assumed to be latent observations from a zero-mean Gaussian process, which models spatial correlation to the data via a spatially structured covariance. Bayesian formulations enable model fit via Markov chain Monte Carlo (MCMC) simulation algorithms [17,18] or other computational algorithms (e.g. integrated nested Laplace approximations (INLA) [19]). INLA is a computational approach for Bayesian inference and is an alternative to MCMC to overcome computational burden for obtaining the approximated posterior marginal distribution for the latent variables, as well as for the hyperparameters [20].
In this study, we aimed to: (i) identify the most important climatic, environmental and socioeconomic determinants of soil-transmitted helminth infections; and (ii) develop model-based Bayesian geostatistics to assess the geographical distribution and number of people infected with soil-transmitted helminths in P.R. China.

Ethical considerations
The work presented here is based on soil-transmitted helminth survey data derived from the second national survey and additional studies identified through an extensive review of the literature. All data in our study was extracted from published sources and they are aggregated over villages, towns or counties; therefore, do not contain information that is identifiable at individual or household level. Hence, there are no specific ethical considerations.

Disease data
Geo-referenced data on soil-transmitted helminth infections from the second national survey conducted in P.R. China [21]. Data were entered into the Global Neglected Tropical Diseases (GNTD) database, which is a geo-referenced, open-access source [21]. Geographical coordinates for the survey locations were obtained via Google maps, a free web mapping service application and technology system. As we focus on recent data pertaining to soil-transmitted helminth infections in P.R. China, we only considered surveys carried out from 2000 onwards.
Climatic, demographic and environmental data Climatic, demographic and environmental data were downloaded from different readily accessible remote sensing data sources, as shown in Table 1. Land surface temperature (LST) and normalized difference vegetation index (NDVI) were calculated to annual averages and land cover data was summarised to the most frequent category over the period of 2001-2004. Moreover, land cover data were re-grouped into six categories based on between-class similarities: (i) forest; (ii) shrubland and savanna; (iii) grassland; (iv) cropland; (v) urban; and (vi) wet areas. Monthly precipitation values were averaged to obtain a long-term average for the period 1950-2000. Four climatic zones were considered: (i) equatorial; (ii) arid; (iii) warm; and (iv) snow/polar. The following 13 soil types, which may be related to the viability of parasites or microorganisms living in the soil, were used: (i) percentage of coarse fragments (CFRAG, % >2 mm); (ii) percentage of sand (SDTO, mass %); (iii) percentage of silt (STPC, mass %); (iv) percentage of clay (CLPC, mass %); (v) bulk density (BULK, km/dm 3 ); (vi) available water capacity (TAWC, cm/m); (vii) base saturation as percentage of ECEsoil (BSAT); (viii) pH measured in water (PHAQ); (ix) gypsum content (GYPS, g/kg); (x) organic carbon content (TOTC, g/kg); (xi) total nitrogen (TOTN, g/kg); (xii) FAO texture class (PSCL); and (xiii) FAO soil drainage class (DRAIN). Human influence index (HII) was included in the analysis to capture direct human influence on ecosystems [22]. Urban/rural extent was considered as a binary indicator. Gross domestic product (GDP) per capita was used as a proxy of people's socioeconomic status. We obtained GDP per capita for each county from the P.R. China Yearbook full-text database in 2008.
Moderate Resolution Imaging Spectroradiometer (MODIS) Reprojection Tool version 4.1 (EROS; Sioux Falls, USA) was applied to process MODIS/Terra data. All remotely sensed data were aligned over a prediction grid of 5 × 5 km spatial resolution using Visual Fortran version 6.0 (Digital Equipment Corporation; Maynard, USA). Data at the survey locations were also extracted in Visual Fortran. As the outcome of interest (i.e. infection prevalence with a specific soil-transmitted helminth species) is not available at the resolution of the covariates for surveys aggregated over counties, we linked the centroid of those counties with the average value of each covariate within the counties. Distances to the nearest water bodies were calculated using ArcGIS version 9.3 (ERSI; Redlands, USA). For county-level surveys, the distances of all the 5 × 5 km pixel centroids to their nearest water bodies within the county were extracted and averaged. The arithmetic mean was used as a summary measure of continuous data, while the most frequent category was used to summarise categorical variables.

Statistical analysis
The survey year was grouped into two categories: before 2005 and from 2005 onwards. Land cover, climatic zones, soil texture and soil drainage were included into the model as categorical covariates. Continuous variables were standardised to mean 0 and standard deviation 1 using the command "std()" in Stata version 10 (Stata Corp. LP; College Station, USA). Pearson's correlation was calculated between continuous variables. One of the two variables, which had correlation coefficient greater than 0.8, was dropped to avoid collinearity [23]. Preliminary analysis indicated that for this dataset, three categories were sufficient to encapsulate for non-linearity of continuous variables, therefore we constructed 3-level categorical variables based on their distribution. Subsequent variable selection incorporated within the geostatistical model selected the most probable functional form (linear vs. categorical). Bivariate and multivariate logistic regressions were carried out in Stata version 10.
Bayesian geostatistical logistic regression models with location-specific random effects were fitted to obtain spatially explicit soil-transmitted helminth infection estimates. Let Y i , n i and p i be the number of positive individuals, the number of those examined and the probability of infection at location i (i = 1, 2,…, L), respectively. We assume that Y i arises from a binominal distribution Y i~B n(p i ,n i ), is a location-specific random effect and ϕ i is an exchangeable non-spatial random effect. To estimate the parameters, we formulate our model in a Bayesian framework. We assumed ε = (ε 1 ,…,ε L ) followed a zero-mean multivariate normal distribution, ε~MVN(0,Σ), where Matérn covariance function Þ : d ij is the Euclidean distance between locations i and j. κ is a scaling parameter, υ is a smoothing parameter fixed to 1 and K υ denotes the modified Bessel function of second kind and order υ. The spatial range ρ ¼ ffiffi ffi 8 p =κ, is the distance at which spatial correlation becomes negligible (<0.1) [24]. We assumed that ϕ i follows a zero-mean normal distribution ϕ i e N 0; σ 2 nonsp : A normal prior distribution was assigned to the regression coefficients, that is β 0 , β k ∼ N(0, 1000) and loggamma priors were adopted for the precision parameters, τ sp ¼ 1=σ 2 sp and τ nonsp ¼ 1=σ 2 nonsp on the log scale, that is log(τ sp ) ∼ log gamma(1, 0.00005) and log(τ nonsp ) ∼ log gamma(1, 0.00005).
The most widely used computational approach for Bayesian geostatistical model fit is MCMC simulation. However, large spatial covariance matrix calculations can increase computational time and possibly introduce numerical errors. Hence, we fitted the geostatistical model using the stochastic partial differential equations (SPDE)/INLA [19,25] approach, readily implemented in the INLA R-package (available at: http://www.r-inla.org). Briefly, the spatial process assuming a Matérn covariance matrix Σ can be represented as a Gaussian Markov random field (GMRF) with mean zero and a symmetric positive definite precision matrix Q (defined as the inverse of Σ) [20]. The SPDE approach constructs a GMRF representation of the Matérn field on a triangulation (a set of non-intersecting triangles where any two triangles meet in at most a common edge or corner) partitioning the domain of the study region [25]. Subsequently, the INLA algorithm is used to estimate the posterior marginal (or joint) distribution of the latent Gaussian process and hyperparameters by Laplace approximation [19].
Bayesian variable selection, using normal mixture of inverse Gammas with parameter expansion (peNMIG) spike-and-slab priors [26] was applied on the model with independent random effect for each location to identify the best set of predictors (i.e. climatic, environmental and socioeconomic). In particular, we assumed a normal distribution for the regression coefficients with a hyperparameter for the variance σ B 2 to be a mixture of inverse Gamma distributions, that is and a σ b σ are fixed parameters. υ 0 is some small positive constant [27] and the indicator I k has a Bernoulli prior distribution I k~b ern(π k ), where π k~b eta(a π ,b π ). We set (a σ ,b σ ) = (5,25) (a π ,b π ) = (1,1) and υ 0 = 0.00025. The above prior of mixed inverse Gamma distributions is called a mixed spike and slab prior for β k as one component of the mixture υ 0 IG(a σ ,b σ ) (when I k = 0) is a narrow spike around zero that strongly shrinks β k to zero, while the other component IG(a σ ,b σ ) (when I k = 1) is a wide slab that moves β k away from zero. The posterior distribution of I k determines which component of the mixture is predominant contributing to the inclusion or exclusion of β k . For categorical variables, we applied a peNMIG prior developed by Scheipl et al. [26], which allows to include or exclude blocks of coefficients by improving "shrinkage" properties. Let β kh be the regression coefficient for the h th category of the k th predictor, then β kh = a k ξ hk , where a k is assigned a NMIG prior described above and ξ hk~N (m hk ,1). Here m hk = o hk -(1-o hk ) and o hk~b ern(0.5), allow to shrink |ξ hk | towards 1. Hence, a k models the overall contribution of the k th predictor and ξ hk estimates the effects of each element β kh of the predictor [27]. In addition, we introduced another indicator I d for selection of either a categorical or a linear form of a continuous variable. Let β kd1 and β kd2 indicate coefficients of the categorical and linear form of k th predictor, respectively, then β k = I d β kd1 + (1 − I d )β kd2 , where I d~B e (0.5). MCMC simulation was employed to estimate the model parameters for variable selection in OpenBUGS version 3.0.2 (Imperial College and Medical Research Council; London, UK) [28]. Convergence was assessed by the Gelman and Rubin diagnostics [29], using the coda library in R [30]. In Bayesian variable selection, all models arising from any combination of covariates are fitted and the posterior probability for each model to be the true one is calculated. The predictors corresponding to the highest joint posterior probability of indicators (I 1 ,I 2 ,…I k ,…,I K ) were subsequently used as the best set of predictors to fit the final geostatistical model.
A 5 × 5 km grid was overlaid to the P.R. China map, resulting in 363,377 pixels. Predictions for each soiltransmitted helminth species were obtained via INLA at the centroids of the grid's pixels. An overall soiltransmitted helminth prevalence was calculated assuming independence in the risk between any two species, that is, where p S , p A , p T and p h indicate the predicted prevalence of overall soil-transmitted helminth, A. lumbricoides, T. trichiura and hookworm, respectively, for each pixel. The number of infected individuals at pixel level was estimated by multiplying the median of the corresponding posterior predictive distribution of the infection prevalence with the population density.

Model validation
Our model was fitted on a subset of the data, including approximately 80% of survey locations. Validation was performed on the remaining 20% by estimating the mean predictive error (ME) between the observed π i and predicted prevalenceπ i at location i, where ME ¼ 1=N Ã∑ i¼1 π i −π i ð Þand N is the total number of test locations. In addition, we calculated Bayesian credible intervals (BCI) of various probability and the percentages of observations included in these intervals.

Data summaries
The final dataset included 1,187 surveys for hookworm infection carried out at 1,067 unique locations; 1,157 surveys for A. lumbricoides infection at 1,052 unique locations; and 1,138 surveys for T. trichiura infection at 1,028 unique locations. The overall prevalence was 9.8%, 6.6% and 4.1% for A. lumbricoides, hookworm and T. trichiura infection, respectively. Details about the number of surveys by location type, study year, diagnostic method and infection prevalence are shown in Table 2. The geographical distribution of locations and observed prevalence for each soil-transmitted helminth species are shown in Figure 1. Maps of the spatial distribution of  environmental/climatic, soil types and socioeconomic covariates used in Bayesian variable selection are provided in Additional file 1: Figure S1.

Spatial statistical modelling and variable selections
The models with the highest posterior probabilities selected the following covariates: GDP per capita, elevation, NDVI, LST at day, LST at night, precipitation, pH measured in water, and climatic zones for T. trichiura; GDP per capita, elevation, NDVI, LST at day, LST at night, precipitation, bulk density, gypsum content, organic carbon content, climatic zone and land cover for hookworm; and GDP per capita, elevation, NDVI, LST at day and climatic zone for A. lumbricoides. The corresponding posterior probabilities of the respective models were 33.2%, 23.6% and 21.4% for T. trichiura, hookworm and A. lumbricoides, respectively. The parameter estimates that arose from the Bayesian geostatistical logistic regression fit are shown in Tables 3, 4 and 5. The infection risk of all three soil-transmitted helminth species decreased considerably from 2005 onwards. We found significant positive association between NDVI and the prevalence of A. lumbricoides. A negative association was found between GDP per capita, arid or snow/polar climatic zones and the prevalence of A. lumbricoides. High precipitation and LST at night are favourable conditions for the presence of hookworm, while high NDVI, LST at day, urban or wet land covers and arid or snow/polar climatic zones are less favourable. Elevation, LST at night, NDVI larger than 0.45 and equatorial climatic zone were associated with a higher odds of T. trichiura infection, while LST at day, arid or snow climatic zones were associated with a lower odds of T. trichiura infection.

Model validation results
Model validation indicated that the Bayesian geostatistical logistic regression models were able to correctly estimate within a 95% BCI 84.2%, 81.5% and 79.3% for T. trichiura, hookworm and A. lumbricoides, respectively. A plot of coverage for the full range of credible intervals is presented in Additional file 2: Figure S2. The MEs for hookworm, A. lumbricoides and T. trichiura were 0.56%,  1.7%, and 2.0% respectively, suggesting that our model may slightly under-estimate the risk of each of the soiltransmitted helminth species.
Predictive risk maps of soil-transmitted helminth infections The high prediction uncertainty shown in Figure 2B is correlated with high prevalence areas. High infection prevalence (>20%) with T. trichiura was predicted for a few small areas of the southern part of P.R. China. Moderate-to-high prevalence (5-20%) was predicted for large areas of Hainan province. High hookworm infection prevalence (>20%) was predicted for Hainan, eastern parts of Sichuan and southern parts of Yunnan provinces. Low prevalence (0.1-5%) of T. trichiura and hookworm infections were predicted for most areas of the southern part of P.R. China, while close to zero prevalence areas were predicted for the northern part.
Estimates of number of people infected Figure 5 shows the combined soil-transmitted helminth prevalence and the number of infected individuals from 2005 onwards. Table 6 summarises the population-adjusted predicted prevalence and the number of infected individuals, stratified by province. The overall population-adjusted predicted prevalence of A. lumbricoides, hookworm and T. trichiura infections were, respectively, 6.8%, 3.7% and 1.8%, corresponding to 85.4, 46.6 and 22.1 million infected individuals. The overall population-adjusted predicted prevalence for combined soil-transmitted helminth infections was 11.4%. For A. lumbricoides, the predicted prevalence ranged from 0.32% (Shanghai) to 27.9% (Guizhou province). Shanghai had the smallest (0.05 million) and Sichuan province the largest number (14.8 million) of infected individuals. For T. trichiura, the predicted prevalence ranged from 0.01% (Tianjin) to 18.3% (Hainan province). The smallest number of infected individuals were found in Nei Mongol, Ningxia Hui, Qinghai provinces and Tianjin (<0.01 million) whereas the largest number, 3.7 million, was predicted for Sichuan province. For hookworm, Ningxia Hui and Qinghai province had the lowest predicted prevalence (<0.01%), while Hainan province had the highest (22.1%). The provinces of Gansu, Nei Mongol, Ningxia Hui, Qinghai, Xinjiang Uygur and Tibet, and the cities of Beijing, Shanghai and Tianjin each had less than 10,000 individuals infected with hookworm. Sichuan province had the largest predicted number of hookworm infections (14.3 million).
The predicted combined soil-transmitted helminth prevalence ranged from 0.70% (Tianjin) to 40.8% (Hainan province). The number of individuals infected with soiltransmitted helminths ranged from 0.07 million (Tianjin) to 29.0 million (Sichuan province). Overall, slightly more than one out of ten people in P.R. China is infected with soil-transmitted helminths, corresponding to more than 140 million infections in the year 2010.

Discussion
To our knowledge, we present the first model-based, nation-wide predictive infection risk maps of soiltransmitted helminths for P.R. China. Previous epidemiological studies [7] were mainly descriptive, reporting prevalence estimates at specific locations or visualized at province level using interpolated risk surface maps. We carried out an extensive literature search and collected published georeferenced soil-transmitted helminth prevalence data across P.R. China, alongside the ones from the second national survey that had been completed in 2004. Bayesian geostatistical models were utilised to identify climatic/environmental and socioeconomic factors that were significantly associated with infection risk, and hence, the number of infected individuals could be calculated at high spatial resolution. We derived species-specific risk maps. Additionally, we produced a risk map with any soil-transmitted helminth infection, which is particularly important for the control of soil-transmitted helminthiasis, as the same drugs (mainly albendazole and mebendazole) are used against all three species [31,32].
Model validation suggested good predictive ability of our final models. In particular, 84.2%, 81.5% and 79.3% of survey locations were correctly predicted within a 95% BCI for T. trichiura, hookworm and A. lumbricoides, respectively. The combined soil-transmitted helminth prevalence (11.4%) is supported by the current surveillance data reported to China CDC that shows infection rates in many areas of P.R. China around 10%. We found that all ME were above zero, hence the predictive prevalence slightly under-estimated the true prevalence of each of the three soil-transmitted helminth species. The combined soil-transmitted helminth prevalence estimates assume that the infection of each species is independent of each other. However, previous research reported significant associations, particularly between A. lumbricoides and T. trichiura [33,34]. Hence, our assumption may over-estimate the true prevalence of soil-transmitted helminths. Unfortunately we do not have co-infection data from P.R. China, and thus we are unable to calculate a correction factor.
Our results indicate that several environmental and climatic predictors are significantly associated with soiltransmitted helminth infections. For example, LST at night was significantly associated with T. trichiura and hookworm, suggesting that temperature is an important driver of transmission. Similar results have been reported by other researchers [2,35]. Our results suggest that the risk of infection with any of the soil-tansmitted helminth species is higher in equatorial or warm zones, compared to the arid and snow/polar zones. This is consistent with earlier findings that extremely arid environments limit the transmission of soil-transmitted helminths [2], while equatorial or warm zones provide temperatures and soil moisture that are particularly suitable for larval development [35]. However, we found a positive association between elevation and T. trichiura infection risk, which contradicts earlier reports [36,37]. The reason may be the altitude effect, i.e. the negative correlation between altitude and economy in P.R. China [38]. The low socioeconomic development in high altitude or mountainous areas might result in limited access to healthcare services [39,40].
On the other hand, it is reported that socioeconomic factors are closely related with the behaviour of people,  which in turn impacts the transmission of soiltransmitted helminths [41]. Indeed, wealth, inadequate sewage discharge, drinking of unsafe water, lack of sanitary infrastructure, personal hygiene habits, recent travel history, low education and demographic factors are strongly associated with soil-transmitted helminth infections [42][43][44][45][46]. Our results show that GDP per capita has a negative effect on A. lumbricoides infection risk.
Other socioeconomic proxies such as sanitation level, number of hospital beds and percentage of people with access to tap water might be more readily able to explain the spatial distribution of infection risk.
Model-based estimates adjusted for population density indicate that the highest prevalence of A. lumbricoides occurred in Guizhou province. T. trichiura and hookworm were most prevalent in Hainan province. Although the overall soil-transmitted helminth infection risk decreased over the past several years, Hainan province had the highest risk in 2010, followed by Guizhou and Sichuan provinces. These results are consistent with the reported data of the second national survey on important parasitic diseases [7], and hence more effective control strategies are needed in these provinces.
The targets set out by the Chinese Ministry of Health in the "National Control Program on Important Parasitic Diseases from 2006 to 2015" are to reduce the prevalence of soil-transmitted helminth infections by 40% until 2010 and up to 70% until 2015 [8]. The government aims to reach these targets by a series of control strategies, including anthelminthic treatment, improvement of sanitation, and better information, education and communication (IEC) campaigns [47]. Preventive chemotherapy is recommended for populations older than 3 years in areas where the prevalence of soiltransmitted helminth infection exceeds 50%, while targeted drug treatment is recommended for children and rural population in areas where infection prevalences range between 10% and 50% [48]. Our models indicate that the first step of the target, i.e. reduction of prevalence by 40% until 2010, has been achieved. Indeed, the prevalence of T. trichiura, hookworm and A. lumbricoides dropped from 4.6%, 6.1% and 12.7% in the second national survey between 2001 and 2004 [7] to 1.8%, 3.7% and 6.8% in 2010, which corresponds to respective reductions of 60.9%, 39.3% and 46.5%. The combined soil-transmitted helminth prevalence dropped from 19.6% to 11.4% in 2010, a reduction of 41.8%. These results also suggest that, compared to T. trichiura and A. lumbricoides, more effective strategies need to be tailored for hookworm infections.
The data of our study stem largely from communitybased surveys. However, the information extracted from the literature is not disaggregated by age, and hence we were not able to obtain age-adjusted predictive risk maps. In addition, more than 96% of observed surveys used the Kato-Katz technique [49,50]. We assumed that the diagnostic sensitivity was similar across survey locations. However, the sensitivity depends on the intensity of infection, and hence varies in space [51]. The above data limitations are known in geostatistical metaanalyses of historical data [27] and we are currently developing methods to address them.

Conclusion
The work presented here is the first major effort to present model-based estimates of the geographical distribution of soil-transmitted helminth infection risk across P.R. China, and to identify the associated climatic, environmental and socioeconomic risk factors. Our prediction maps provide useful information for identifying priority areas where interventions targeting soiltransmitted helminthiasis are most urgently required. In a next step, we plan to further develop our models to address data characteristics and improve model-based predictions.

Additional files
Additional file 1: Figure S1. Spatial distribution of environmental/climatic, soil types and socioeconomic factors in P.R. China.
Additional file 2: Figure S2. Model validation results. Percentage of survey locations with observed prevalence included within the Bayesian credible interval (BCI) of various probability coverage cut-offs (bar plots) calculated from the posterior predicted distribution. Solid lines indicate the corresponding width of BCI.