Identifying biotic interactions which drive the spatial distribution of a mosquito community

Background Spatial variation in the risk of many mosquito-borne pathogens is strongly influenced by the distribution of communities of suitable vector mosquitoes. The spatial distributions of such communities have been linked to the abiotic habitat requirements of each constituent mosquito species, but the biotic interactions between mosquitoes and other species are less well understood. Determining which fauna restrict the presence and abundance of key mosquito species in vector communities may identify species which could be employed as natural biological control agents. Whilst biotic interactions have been studied in the laboratory, a lack of appropriate statistical methods has prohibited the identification of key interactions which influence mosquito distributions in the field. Joint species distribution models (JSDMs) have recently been developed to identify biotic interactions influencing the distributions of species from empirical data. Methods We apply a JSDM to field data on the spatial distribution of mosquitoes in a UK wetland to identify both abiotic factors and biotic interactions driving the composition of the community. Results As expected, mosquito larval distributions in this wetland habitat are strongly driven by environmental covariates including water depth, temperature and oxidation-reduction potential. By factoring out these environmental variables, we are able to identify species (ditch shrimp of the genus Palaemonetes and fish) as predators which appear to restrict mosquito distributions. Conclusions JSDMs offer vector ecologists a way to identify potentially important biotic interactions influencing the distributions of disease vectors from widely available field data. This information is crucial to understand the likely effects of habitat management for vector control and to identify species with the potential for use in biological control programmes. We provide an R package BayesComm to enable the wider application of these models. Electronic supplementary material The online version of this article (doi:10.1186/s13071-015-0915-1) contains supplementary material, which is available to authorized users.

Section A -Statistical model and inference Statistical model The joint species distribution models (JSDM) we apply here (and which is implemented in the R package we provide BayesComm) is specified as a multivariate binomial regression model. We use a multivariate extension of the latent variable model for binary regression (??). Our approach is very similar to a model recently described for analysis of ecological communities (?) except that we draw the latent variable from a normal, rather than a logistic, distribution. This is equivalent to using a probit function rather than a logit function as the canonical link in a univariate binomial regression. The probit and logit functions are very similar, with the only apparent drawback of the probit model being that regression coefficients for binary covariates can no longer be interpreted directly as log odds ratios (which are not widely used in ecology). The advantage of our approach is that it enables use of a Gibbs sampler to make inference about the regression parameters, thereby reducing computation time and the risk of numerical errors. The model is defined as: where y i,j is a binomial variable representing presence (1) or absence (0) of species j at site i, z is a normally distributed latent variable, 1(z > 0) is an indicator function returning 1 when z > 0 and 0 otherwise, X j is an n by k j design matrix for species j, β j is a vector of k j regression coefficients for species j and N (0, R) is an m-dimensional standard multivariate normal distribution with mean vector 0 and symmetric, positivedefinite correlation matrix R, n is the number of sites, m the number of species and k j the number of environmental covariates used to model the fundamental niche of species j. The elements of R describe whether species co-occur more or less often than would be expected by their fundamental niches alone and is indicative of the underlying network of interactions between species in the community.

Model inference
We use a computationally efficient Gibbs algorithm, based on a sampler described by ?, to sample the model parameters in turn from their conditional posterior distributions. At each iteration this entails the following steps: 1. Sample the latent variables z from a truncated multivariate standard normal distribution: such that z ij is positive when y ij = 1 and negative otherwise.
2. Sample the vector of regression coefficients β j for each species j from a multivariate normal distribution: where σ is the standard deviation of the prior distribution over β j , I is an identity matrix, having diagonal elements 1 and all other elements 0 and denotes the matrix transpose.
3. Sample the correlation matrix R by first sampling a covariance matrix W and scaling this to a correlation matrix: where diag(.) denotes the diagonal vector of a matrix and W m (., .) is a Wishart distribution of dimension m (the number of species). The scale matrix S and degrees of freedom parameter ν define the prior distribution over W and therefore R.
By sampling a covariance matrix and scaling to a correlation matrix, we are able to use a Gibbs sampler whilst avoiding the non-identifiability of the variance of a model for binary data. The implications of this approach are discussed in ?. Our choices of prior parameters σ, S and ν are discussed in Section 2

Section B -Choice of priors
To construct a sampler for Bayesian statistical models it is necessary to specify priors over all of the parameters for which we want to make inference. These express our prior belief (before observing any data) about the probability distribution of the parameters. Here we use conjugate priors to enable the efficient Gibbs sampler described above.
For each element of each vector of regression coefficients β j we use a diffuse normal prior with mean 0 and variance 100 by setting σ to 10. This is a widely used prior which exhibits little influence on the posterior. Specification of an appropriate prior for the unidentified covariance matrix W is less straightforward. A commonly used conjugate prior for W is obtained by setting: This prior has the feature that each element of a correlation matrix derived from W has a marginally uniform distribution and it therefore has no impact on the posterior. Such a prior is problematic for our model for two reasons. Firstly, a uniform prior implies that it is equally likely for the distributions of two species to be very strongly correlated (i.e. always found together or never found together) as it is for there to be no correlation between them. This is biologically unrealistic; we would expect the majority of inter-species interactions to be weak or non-existent with relatively few interactions driving moderate correlations in distributions (?). Secondly, a prior of this sort exhibits a dependency between the unobserved variance parameters of W and the correlation coefficients of R, such that when these variance parameters are large, the prior assigns much higher probability to strong correlations than weak correlations. This leads to very unrealistic posterior parameter estimates with a bimodal distribution close to 1 and -1. ? demonstrate a weakly informative prior which avoids this problem but maintains conjugacy for the likelihood: where n is the number of observations. With few observations this gives a reasonable non-uniform prior. As the number of observations increases the prior becomes centred on 0, but with the increased amount of data its influence on the posterior becomes weaker. The shape of this prior and its weak, but informative, effect on the posterior are illustrated using simulated data in Fig. 2.1 and R code to reproduce these figures is given in the supplementary material.

Section C -Parameter estimates
The posterior distributions of the regression coefficients from the environment-only and full model are summarised in Fig. 3.1. Summaries of the posterior distributions of the correlation coefficients and the percentage of deviance explained per species from the communityonly and the full model are shown in Fig. 3