Mitochondrial genomes of two phlebotomine sand flies, Phlebotomus chinensis and Phlebotomus papatasi (Diptera: Nematocera), the first representatives from the family Psychodidae

Leishmaniasis is a worldwide but neglected disease of humans and animal transmitted by sand flies, vectors that also transmit other important diseases. Mitochondrial genomes contain abundant information for population genetic and phylogenetic studies, important in disease management. However, the available mitochondrial sequences of these crucial vectors are limited, emphasizing the need for developing more mitochondrial genetic markers. The complete mitochondrial genome of Phlebotomus chinensis was amplified in eight fragments and sequenced using primer walking. The mitochondrial genome of Phlebotomus papatasi was reconstructed from whole-genome sequencing data available on Genbank. The phylogenetic relationship of 24 selected representatives of Diptera was deduced from codon positions 1 and 2 for 13 protein coding genes, using Bayesian inference (BI) and maximum likelihood (ML) methods. We provide the first Phlebotomus (P. chinensis and P. papatasi) mitochondrial genomes. Both genomes contain 13 protein-coding genes, 22 transfer RNA genes, two ribosomal RNA genes, and an A + T-rich region. The gene order of Phlebotomus mitochondrial genomes is identical with the ancestral gene order of insect. Phylogenetic analyses demonstrated that Psychodidae and Tanyderidae are sister taxa. Potential markers for population genetic study of Phlebotomus species were also revealed. The generated mitochondrial genomes of P. chinensis and P. papatasi represent a useful resource for comparative genomic studies and provide valuable future markers for the population genetic study of these important Leishmania vectors. Our results also preliminary demonstrate the phylogenetic placement of Psychodidae based on their mitochondrial genomes.


Background
Phlebotomine sand flies are small insects in the family Psychodidae, and are important vectors of human disease including protozoal parasite, bacteria, and viruses [1] making these insects a global public health concern. Leishmaniasis is one of the world's most neglected diseases transmitted by phlebotomine sand flies, causing significant mortality and morbidity in more than 80 countries of both the Old and New World. The majority of Old World vector species belong to the genus Phlebotomus (42 vector species) while the New World is dominated by the genus Lutzomyia (56 vector species) [2]. Of Phlebotomus species, two are of particular interest; Phlebotomus chinensis and Phlebotomus papatasi. Phlebotomus chinensis, the main vector of mountainous sub-type of zoonotic visceral leishmaniasis, has wide geographical distribution extending from the Yangtze River to northeast China [3][4][5]. In recent years, the number of visceral leishmaniasis (VL) cases and its endemic foci has increased (54.37 % and 41.86 % respectively) compared to that of the 1990s in China. Until now, six provinces/autonomous regions still reported autochthonous cases. The area of mountainous sub-type of zoonotic VL covers four provinces which possess almost half of the total cases [6][7][8]. Prevention and control of vector P. chinensis is important to reduce the public health threat of VL in endemic regions. Phlebotomus papatasi is the vector of sand fly fever and zoonotic cutaneous leishmaniasis in Middle East and Mediterranean regions and is also an important model organism used to study sand flies-host-parasite interactions [9][10][11][12].
In recent years, the mitochondrial genome has become increasingly important in phylogenetic analysis, biological identification and population studies, due to its rapid evolutionary rate, low recombination and maternal inheritance [13,14]. Although microsatellites and individual gene sequences, such as Cytb and ND4, have been used for sand fly studies in the past [15][16][17], the mitochondrial genome of phlebotomine sand flies has gone largely unstudied which is surprising given their pathogenic potential. The complete mitochondrial genome contains important information not available in examining individual genes, including genome-level characteristics for phylogenetic reconstruction. Additionally, due to the varying rates of gene evolution, the mitochondrial genome can also provide various molecular markers for studying phylogenetic relationships at different taxonomic levels, including intraspecies population structure.
Despite these benefits, information on the mitochondrial genomes of Diptera is still limited, especially for representatives of Nematocera. Most of these genomes are sequenced by long PCR with primer walking method. As the widespread application of next-generation sequencing (NGS), long PCR with next-generation sequencing, and direct shotgun sequencing methods has been utilized in mitochondrial genomes determination [18,19]. Although the Sanger sequencing is still the indispensable method, the NGS method is relatively fast and inexpensive especially for direct shotgun sequencing method. In fact, this method for reconstruction of mitochondrial genomes becomes one of the simplest approaches. In the present study, we determined the complete mitochondrial genome of two important Leishmania vectors, P. chinensis and P. papatasi with long PCR with primer walking method and reconstruction from direct shotgun sequencing data respectively, reporting their genome features and analyzing the overall phylogenetic status of Psychodidae within Diptera. The addition of new mitochondrial genomes from nematoceran species would be of critical importance in understanding the evolution of Nematocera mitochondrial genome and examining the phylogeny in the Nematocera and Diptera.

Specimen collection and DNA extraction
Specimens of P. chinensis were collected from Wen County (104.25°E, 33.18°N), Gansu province, China. All specimens were preserved in 95 % ethanol and stored at −20°C until DNA extraction. DNA was extracted from the single adult P. chinensis using the TIANamp Micro DNA Kit (Tiangen Biotech, Beijing, China) according to the manufacturer's protocol.

Mitochondrial genome determination
The complete mitochondrial genome of P. chinensis was amplified in eight overlapping PCR fragments from a single adult. First, six fragments were amplified using previously published primers (Table 1). Then, from the generated sequences, two specific primers were designed for amplifying overlapping fragments spanning the whole mitochondrial genome. Short fragments (<2 kb) were amplified using TaKaRa rTaq (not proof-reading; Takara Biotech, Dalian, China; http://www.takara.com.cn) with the following cycling conditions: an initial denaturation for 1 min at 93°C, followed by 35 cycles of 10 s at 92°C, 1.5 min at 48-57°C, 1-2 min at 72°C, and final extension of 6 min at 72°C. Long fragments (>2 kb) were amplified using TaKaRa LA Taq (proof-reading; Takara Biotech, Dalian, China; http://www.takara.com.cn) under the following cycling conditions: an initial denaturation for 1 min at 94°C, followed by 40 cycles of 20 s at 93°C, 30 s at 48-54°C, 3-6 min at 68°C, and final extension of 10 min at 68°C. After purification with PCR Purification Kit (Sangon Biotech, Shanghai, China), all PCR products were sequenced directly with the PCR primers and internal primers generated by primer walking. The complete mitochondrial genome of P. papatasi was reconstructed from 454 sequencing data publicly available in the Sequence Read Archive (SRA) of GenBank (Accession number: SRX027115). Reconstruction was done by the baiting and iterative mapping approach of [20] using software MITObim v1.7 with default parameters [21,22]. The mitochondrial genome of P. chinensis as the reference sequence.

Sequence analyses
Contiguous sequence fragments were assembled using Staden Package v1.7.0 [23]. Protein coding genes (PCGs) and ribosomal RNA (rRNA) genes were identified based on homologous regions of other dipteran insects using the Clustal X [24]. Transfer RNAs (tRNA) and their potential cloverleaf structures were identified by tRNAscan-SE 1.21 [25]. The secondary structure of the two rRNA genes was determined mainly by comparison with the published rRNA secondary structures of Drosophila melanogaster and Drosophila virilis [26]. Tandem Repeat Finder v4.07 was used to identify tandem repeats in A + T-rich region [27]. The base composition and codon usage were calculated with MEGA 5.1 [28]. AT and GC skew were calculated according to the formulae: AT skew = (fA − fT) / (fA + fT) and GC skew = (fG − fC) / (fG + fC). Sliding window analyses were performed using DnaSP v5 [29]. A sliding window of 500 bp (in 25 bp overlapping steps) was used to estimate nucleotide diversity Pi (π) across the alignment of P. chinensis, P. papatasi and Lutzomyia umbratilis [30] mitochondrial genomes excluding the A + T-rich region.

Phylogenetic analyses
For the phylogenetic analyses, a total of 24 representative species from Diptera were used to build the alignment (Table 2), with Bittacus pilicornis used as the outgroup (Mecoptera). All 13 PCGs were extracted and translated (excluding the stop codon) using the invertebrate mitochondrial genetic code. We used the Clustal X for alignment of the inferred amino acid sequences. Then the alignments were transferred to the DNA sequences, and third codon positions removed. The best-fit model (GTR + Γ + I) was estimated by the Akaike information criterion in jModelTest [31]. MrBayes ver.3.1.2 [32] and RAxML ver.7.2.8 [33] were used to construct a maximum likelihood (ML) and bayesian inference (BI) phylogeny. For ML analyses, bootstrap analysis was performed with 1,000 replicates. For BI analyses, two sets of four chains were allowed to run simultaneously for 1,000,000 generations. Each set was sampled every 100 generations with a burn-in of 25 %. Stationarity was considered to be reached when the average standard deviation of split frequencies was less than 0.01.

Genome organization and composition
The circular mitochondrial genome of P. chinensis (Gen-Bank accession number KR349297) is 16,277 bp in size. The complete mitochondrial genome of P. papatasi (GenBank accession number KR349298), 15,557 bp, was assembled from a total of 5579 reads identified as being of mitochondrial origin. An average per base estimated coverage of reconstructed mitochondrial genome of P. papatasi is~209× based on the mean read length. The mitochondrial genome size differential stems mainly from the varying length of the A + T-rich region caused by variability in the number of tandem repeats. Consistent with published dipteran mitochondrial genomes, both Phlebotomus mitochondrial genomes contain 13 protein-coding genes (PCGs), 22 transfer RNA (tRNA) genes, two ribosomal RNA (rRNA) genes, and an A + Trich region ( Table 3). The majority-coding strand (Jstrand) and the minority-coding strand (N-strand) encode 23 and 14 genes, respectively ( Fig. 1). All the 37 genes share the identical arrangement with the hypothesized ancestral pancrustacean mitochondrial genome. The base composition of the Phlebotomus mitochondrial genome is biased toward A + T, with a total A + T content (J-strand) of 79.2 % and 77.5 % for P. chinensis and P. papatasi, respectively. We calculated the AT content, AT-and GCskew of PCGs, RNAs and the control region of three sand flies (Table 4), and found that these regions also possess high A + T content, in particular the third codon position of PCG and control region is distinctly higher than that of other regions.

Protein-coding genes and codon usage
All the protein-coding genes of P. chinensis start with the typical ATN codon except for COI (Table 3). In comparison with P. chinensis, only ND2 and ND3 have the different start codon in P. papatasi. The start codon of COI in P. chinensis and P. papatasi is uncommon start codon TCG, which is also reported for COI in some nematoceran mitochondrial gneomes [34][35][36]. The conventional stop codons TAA or TAG were used in all the PCGs of P. chinensis, while ND4 of P. papatasi terminates with the incomplete stop codon T. The conserved 7-bp overlap (ATGATAR) between ATP8 and ATP6 present in all known nematoceran mitochondrial genomes was found in Phlebotomus. However, the typical nematoceran 7-bp overlapping region between ND4 and ND4L was not observed in the mitochondrial genomes of phlebotomine sand flies, in contrast, these two genes overlapped by one nucleotide. The codon usage patterns of P. chinensis, P. papatasi, and L. umbratilis were summarized and the relative syn

Transfer and Ribosomal RNAs
All typical tRNA genes of metazoan mitochondrial genomes were identified in both Phlebotomus mitochondrial genomes studied. All the 22 tRNAs of P. chinensis, P. papatasi, and L. umbratilis have the common cloverleaf secondary structure, while the DHU arm of trnS AGN is short with only one complementary base pair. All anticodon usage is identical with that described for other nematoceran mitochondrial genomes, except for trnS AGN of L. umbratilis, which uses TCT instead of the common GCT. Considering the codon usage, the RSCU of codon AGA (the corresponding codon to anticodon of trnS AGN ) is overwhelmingly higher than those of other three synonymous codons in L. umbratilis. The frequency of AGA is moderate rich in P. chinensis and P. papatasi, however the corresponding codon (AGC) to anticodon (GCT) of their  trnS AGN is rarely used. The most conserved tRNAs among P. chinensis, P. papatasi, and L. umbratilis are trnL UUR , trnL CUN , trnS UCN and trnI, however trnA, trnR and trnC exhibit low level of identical nucleotides. The inferred secondary structure models of small ribosomal subunit (rrnS) and large ribosomal subunit (rrnL) for P. chinensis are shown in Figs. 3 and 4, respectively. The secondary structure of rrnS and rrnL contain three and six domains, respectively. The domain III of rrnL is absent, which was reported in the secondary structure of other arthropodan rrnL [26,37]. The overall structures of P. chinensis rRNAs resemble that of other insects. Comparative analyses on secondary structures among P. chinensis, P. papatasi, and L. umbratilis manifest uneven distribution of conserved nucleotides, in that domains I and III of the rrnS are more conserved than domain II, and domains I, II, and VI in rrnL have more variable sites. Variable positions of rrnS are largely restricted to H47, H673, H1305 and the region between H577 and H673, and H567 and H769. Domains IV and V of rrnL contain mainly conserved helixes.

The A + T-rich region
The A + T-rich regions of P. chinensis and P. papatasi are 1,433 bp and 723 bp respectively, which harbor a high rate of A + T base composition (91.1 % for P. chinensis and 92.3 % for P. papatasi). The A + T-rich regions of P. chinensis contains seven identical tandem repeat units of 159-bp sequence and another shortened tandem repeat unit with only 79-bp. In P. papatasi, there are three tandem repeat units, the first two (162-bp) are nearly identical with one substitution at the 159th position, while the third one is a shortened repeat unit (89-bp). All the tandem repeat sequences of P. chinensis and P. papatasi begin in the rrnS gene, but the tandem repeat sequences (372-bp for repeat unit) of L. umbratilis are located in the central region of A + Trich region. Additionally, the alignments of tandem repeat units of P. chinensis and P. papatasi show 60.2 % similarity, but there is no evidence for homologous repeat motifs between species of Phlebotomus and L. umbratilis. Abundant microsatellite-like elements occur throughout the region between the tandem repeat sequence and trnI (e.g. (AT)3, (AT)5, (AT)6, (AT)8, (TA)4, and (TA)6 in P. papatasi). These tandem repeat units and microsatellite-like elements are potentially useful markers for the study of geographical population structure [38].  The accurate estimation of length and number of repeats and assembly of A + T-rich region are often difficult, particularly for including various complex repeat regions. For obtaining the accurate A + T-rich region of P. chinensis, Sanger sequencing with paired ends can cover the length of repeat region (approximate 1.2 kb), and agarose gel electrophoresis for amplified control region was used to determine the correct size and number of the length of repeat region. In control region of P. papatasi, we reconstructed the similar pattern of architecture for P. chinensis. The high coverage and comparatively long read length also make sequence accurate.
Nucleotide diversity of mitochondrial genome among Phlebotomus chinensis, P. papatasi and Lutzomyia umbratilis A sliding window analysis was performed to estimate nucleotide diversity Pi (π) across the mitochondrial genomes of P. chinensis, P. papatasi and L. umbratilis, excluding the A + T-rich region (Fig. 5). The sliding window indicated that the most variable coding regions were within ND5 gene suggesting that these regions are under accelerated evolution and few selective constraints, and can be used as effective markers to investigate population structure and potentially resolve the phylogenetic relationship  of closely related species. Not unexpectedly, the overall sequence variability of the rRNA regions is lower than that of other regions. The most conserved fragments were found in the rrnL region. Amongst PCGs, COI and ND1 were the most conserved. By contrast, ND6, ATP8 and ND3 displayed the high variability.

Phylogenetic analyses
Diptera is a megadiverse group of extant insects. Historically, Diptera was divided into two suborders, Nematocera and Brachycera. Brachycera was confirmed as a monophyletic group with robust phylogenetic analyses, but Nematocera is generally accepted as a paraphyletic group and Brachycera is derived from part of these lineages. The mitochondrial genome contains much information and has been used to resolve the phylogenetic relationships of Diptera, especially that of Brachycera [39][40][41][42]. In the present study, the phylogenetic relationships inferred from ML analyses and BI analyses using only first and second codon positions of 13 PCGs share similar topologies (Fig. 6). Consistent with previous results, Brachycera formed a monophyletic group and clustered with Bibionomorpha as the sister group [43,44]. Surprisingly, Psychodidae species clustered with Protoplasa fitchii, the lone representative of Tanyderidae with high support, which is the first time this relationship has been elucidated by mitochondrial data (bootstrap value of 98 % in ML analyses and Bayesian posterior probabilities (Bpp) of 1 in BI analyses) and identical to results of other molecular datasets [43,44]. This clade was derived from Culicomorpha but the node was weakly supported (<50 % for bootstrap value and 0.7 for Bpp) suggesting the relationship between this branch and Culicomorpha is still ambiguous. However, the close relationship within this large clade was confirmed by moderate node support (72 % for bootstrap value and 0.99 for Bpp), which is in accordance with previous studies using multiple markers [43]. The traditional basal branch comprised of Tipulidae and Trichoceridae (Tipulomorpha) was not grouped as a monophyletic clade, instead Tipulidae was an early split in the phylogeny of Diptera. While the families Ptychopteridae and Trichoceridae formed a branch that clustered with all remaining groups as the sister group. This arrangement of basal branches is identical with 13PCG12 (third codon sites removed) + rRNAs dataset, however 5PCG12 (COI-III, Cytb, and ATP6) + rRNAs dataset shows a different topology [34]. However, using different phylogenetic hypotheses caused the topology to change, with Tipulomorpha containing Tipulidae or Tipulidae + Trichoceridae [44][45][46], therefore we can conclude that the basal placement of Tipulomorpha in the phylogeny of Diptera is stable. Phylogenetic analyses in this study were based only on mitochondrial data, so we believe it is still indispensable to combine nuclear and mitochondrial data with a broader taxon sample to provide an even more robust phylogenetic analyses depicting the evolution of the Diptera.

Implications
Low flight capacity, a preference to remain close to area of emergence, geographic barriers and variability in climate across their distribution has led to genetically structured populations of phlebotomine sand flies, with cryptic species also being recorded [47,48]. Genetically distinct species and populations have demonstrated a varying ability to both transmit Leishmania and resist insecticides [49][50][51] highlighting the need to quantify their population structure and delineate cryptic species. The sliding window analysis presented in this study provides a useful comparison of the evolutionary rates of each gene, allowing future researchers to design population genetic and large-scale phylogenetic studies utilizing the most appropriate marker for their task. One immediate use for such data will be the exploration of the relationships between P. chinensis and another disputed and close relative vector species Phlebotomus sichuanensis or 'large type of P. chinensis' [52][53][54]. It is debated whether these two nominal species are in fact distinct or if they are different populations of the same species occupying different altitudes [16,53,55].
NGS technology has been routinely used in genomic research with Illumina and 454 platforms. Although, these sequencing technology have been verified to obtain mitochondrial genomes for insects, the A + T-rich region is still difficult to assemble owing to various complex repeat regions [56,57]. Ramakodi et al. [56] reported that the coverage may not have the crucial factors for reconstruction of control region using 454 reads, and known repeat sequences can help to reconstruct the full length of control region. In the present study, we successfully retrieved the complete mitochondrial genome with entire A + T rich region using P. chinensis as the reference. Both these control regions contain a similar pattern of repeat sequences, and the repeat units also hold 60.2 % similarity suggesting control region (or repeat sequences) of closely related species may contributes to the reconstruction of a new control region. Furthermore, the results also indicate that mitochondrial genome of closely related species as reference are more appropriate than shot target sequences for reconstruction of the full length of control region, in particular to that including complex repeat sequences. In other words, it suggests the reference species and sequence must be carefully selected when using the same approach. These first Phlebotomus mitochondrial genomes will make it easier to generate additional mitochondrial genomes data including control region from different population and species which will provide insight into the speciation, distribution pattern, evolution and divergence times of sand flies at the genome-level [58][59][60][61].

Conclusion
The present study determined the mitochondrial genomes of P. chinensis and P. papatasi, and conducted a comparative analysis of three sand fly mitochondrial genomes. We present the first examination of the phylogenetic status of the Psychodidae and, based on all mitochondrial PCGs, provide stable support that families Psychodidae and Tanyderidae are sister taxa. We confirmed the known sequences in control region of closely related species facilitate the reconstruction of uncharted control region using the similar approach. Our results also provide a source of genetic markers for future studies on the population biology and molecular phylogeny of these important vectors.