Identification and phylogenetic analysis of voltage-gated sodium channel haplotypes in the malaria vector Anopheles sinensis using a high-throughput amplicon sequencing approach

Background Anopheles sinensis is a dominant vector for malaria transmission in Asian countries. Voltage-gated sodium channel (VGSC) mutation-mediated knock-down resistance (kdr) has developed in many A. sinensis populations because of intensive and long-term use of pyrethroids. Our previous study showed that multiple mutations at position 1014 of the VGSC were heterogeneously distributed in A. sinensis populations across Sichuan, China. Methods To understand resistance genotypes at the haplotype level and reconstruct the phylogenetic relationship of VGSC haplotypes, a cost-effective next-generation sequencing (NGS)-based amplicon sequencing approach was established to clarify haplotypes containing codon 1014 of the VGSC gene from a total of 446 adults collected in 12 locations of Sichuan, China. Results Nineteen (19) haplotypes were identified, including 11 wild 1014L, 6 resistance 1014F, and 2 resistance 1014C haplotypes. We found that resistance haplotypes of A. sinensis VGSC were widely distributed at frequencies ranging from 3.67 to 92.61%. The frequencies of the 1014C haplotype in the southeast of Sichuan (Luzhou, Guangan, and Suining) were relatively higher than those in other sampling locations. Phylogenetic analyses support that kdr-type mutation at position 1014 is not singly originated and resistance 1014C haplotypes evolve from TTT-encoding 1014F. Conclusions A cost-effective next-generation sequencing (NGS)-based amplicon sequencing approach has been established in this study. The data revealed the patchy distribution of VGSC resistance haplotypes with overall high frequencies in Sichuan, China. Phylogenetic analyses support multiple origins and sequential evolution (1014L → 1014F → 1014C) for kdr-type mutations in A. sinensis. Graphical abstract

Sichuan province is located in the southwest of China (Fig. 1). The natural environment in most parts of Sichuan is suitable for the breeding of malaria vectors. Historically, Sichuan has been a malaria-endemic region in China. Although no indigenous malaria case has been reported since 2011, the increasing imported cases indicate a potential risk for the re-emergence of malaria in Sichuan [10,11]. The intensive use of insecticides such as deltamethrin and permethrin for mosquito-targeted control and in agriculture has led to the development of insecticide resistance in the primary malaria vector Anopheles sinensis in this region [12,13].
A recent study detected multiple mutations at position 1014 of the VGSC in Sichuan A. sinensis by Sanger sequencing [14]. However, due to the diverse polymorphisms in the VGSC gene, direct DNA sequencing of PCR product from heterozygous individuals is incapable of haplotype identification. The identification of haplotype is useful for understanding the origin and evolution of kdr-associated mutations and helpful for monitoring distribution and spread of kdr haplotypes. In this context, we adopted the NGS platform to establish an amplicon-based sequencing pipeline to clarify VGSC haplotypes. The specific objectives of this effort were (i) to identify VGSC haplotypes and delineate their geographical distribution in A. sinensis populations across Sichuan and (ii) to elucidate the evolutionary relationship of kdr-type resistance mutations.

Anopheles sinensis collections
A total of 446 adult A. sinensis were caught from 12 sampling locations across Sichuan from August to September in 2018. The sampling locations are shown in Fig. 1.

Anopheles sinensis genomic DNA (gDNA) preparation
The gDNA was isolated from each mosquito using the protocol of Rinkevinch et al. [15]. The gDNA concentration for each mosquito was detected by Nanodrop 2000 (Thermo, USA) and then adjusted to 50 ng/μl with ddH 2 O. Then, the gDNA of individual mosquitoes from the same sampling location was pooled in

VGSC amplicon preparation and sequencing
Primers with population-specific tags for amplicon preparation were designed based on the primers used to amplify a 325-bp genomic region that contains codon 1014 and a complete intron of the VGSC (kdr-F: 5′-TGC CAC TCC GTG TGT TTA GA; kdr-R: 5′-GAG   CGA TGA TGA TCC GAA AT) [16]. The populationspecific tagged sequence (4-6 bp) was added at the 5′ end of the primer pairs. Details of the primers with tag sequences are shown in

Bioinformatics pipeline
Two FASTQ format files with a total of 1.496 G raw data were obtained after sequencing. Low-quality sequences were filtered using Trimmomatic [17], and a total of 1.487 G clean data was obtained. Clean data (FASTQ format) were then submitted to pandaseq 2.11 [18] for pair-end assembling. Subsequently, based on population-specific tagged-primer sequences, assembled sequences belonging to the respective populations were separated by the Linux opera system's grep command (Ubuntu 20.04 LTS). Sequence reverse complement was performed by Seqtk (https:// github. com/ lh3/ seqtk). For chimera sequence filtering, sequences with both 5′-end and 3′-end matched population-specific tagged-primer were passed. The deletion of pair-end tag sequences was performed by the Linux opera system's sed command. After that, identical sequences from each population were merged and counted using FASTX-Toolkit (http:// hanno nlab. cshl. edu/ fastx_ toolk it/). The haplotype with a count of less than the threshold was removed (threshold = S/2 N; S is the sum of the number of clean sequences in each population, and N is the sample size from each sampling location). The brief pipeline is presented in Fig. 2 (VIII-XII).

Data analysis
Referring to the method of Délye et al. [19], Pearson's correlation analysis was used for assessing the agreement in allele frequencies obtained by Sanger sequencing [14] and the NGS approaches. The Pearson's correlation coefficient (r) was calculated by function cor () in the R programming language. The haplotypes identified in this study and those retrieved from GenBank [20][21][22][23] were used for constructing a phylogenetic tree by the maximum likelihood (ML) method and Kimura two-parameter model [24] (1000 bootstrap replicates) in MEGA-X software [25]. Homologous sequences of Anopheles culicifacies were used as outgroup. The genealogical network analysis of haplotypes was performed by Network 5.0 (https:// www. fluxus-engin eering. com/ share net. htm) using median joining for network calculation [26].

Identification of VGSC haplotypes
Fourteen nucleotide polymorphic sites (eight in intron and six in exon) were observed in the amplified fragments (325 bp) of the A. sinensis VGSC gene (Fig. 3). The mutations at nucleotides 203 and 204 led to nonconservative amino-acid residue substitutions (L to F or L to C) at position 1014. In addition, a previously reported mutation at locus 178 (A/T) [20,22], which causes a conservative substitution (L1006I), was also detected. Based on all the DNA sequences, a total of 19 haplotypes were identified, including 11 wild-type 1014L, 6 kdr-type 1014F, and 2 kdr-type 1014C. Five haplotypes (H8-L, H14-L, H19-L, H11-F, and H18-C) were newly identified in this study ( Table 2).

Frequency and distribution of each VGSC haplotype
The number of haplotypes in each sampling location ranged from 6 to 13 (Table 3). H2-L and H3-L were the two common wild VGSC haplotypes distributed in all locations except NJLC, while H12-L and H14-L were rare and only distributed in PZH and GAQF, respectively. The overall frequency of resistance haplotypes was high in Sichuan (Table 3 and Fig. 4). Notably, H1-F was the predominant resistance haplotype across Sichuan except PZH (Fig. 4). H7-C had higher frequencies in LZJY, GAQF, and SNDY than in other sampling locations and was detected in 10 of the 12 locations, while H18-C was only distributed in LZJY at a low frequency (Table 3, Fig. 4).

Discussion
In this study, the NGS platform was deployed to understand the type and frequency of kdr mutation at the haplotype level. Correlation analysis showed that the frequency of kdr alleles detected by the NGS-based method herein was in good agreement with that determined by the Sanger sequencing [14]. Pearson's correlation coefficients, i.e. 0.912 (p-value = 3.54 × 10 -5 ), 0.906 (p-value = 4.84 × 10 -5 ), and 0.954 (p-value = 1.49 × 10 -6 ) for the frequency of 1014L, 1014F and 1014C, respectively (Fig. 7), suggest that the NGS-based method is reliable. By comparison, NGS-based amplicon sequencing is more costeffective than Sanger sequencing [27] and can clarify other nucleotide variations in the sequence of interest for haplotype identification.
Using this NGS approach, 19 VGSC haplotypes of A. sinensis were identified in Sichuan. The eight resistance haplotypes display a patchy distribution, and each sampling location has at least one resistance haplotype (Fig. 4). Notably, H1-F is distributed in all sampling locations except PZH and has an extremely high frequency (77.2%) in NJLC, and it also is the predominant resistance haplotype in other locations (Table 3 and Fig. 4). This observation led us to postulate that H1-F may have better protection and/or lower fitness cost than other kdr-type haplotypes in Sichuan. Special attention thus should be paid to this haplotype in future surveillance.
Interestingly, beyond Sichuan, both H1-F and H7-C have been previously documented in Zhejiang, Henan, Anhui, Guangxi, Guizhou, Hubei, and Jiangsu provinces of China [20,23,[28][29][30] (see Fig. 1 for location). The wide distribution of H1-F and H7-C in locations that are hundreds of miles apart may suggest possible gene flows between these A. sinensis populations,  (Table 2). Haplotypes are connected to one another based on their similarity. Different haplotypes are highlighted by colors (1014L is green, 1014F is red, and 1014C is yellow). Short hash marks perpendicular to branches indicate the number of base pair mutations between haplotypes. The character next to the short hash mark shows the mutation pattern. EX and IN are abbreviations for exons and introns, respectively though we could not exclude the possibility of independent origins of the same haplotype in different regions.
The topology of the ML tree (Fig. 5) and MJ network (Fig. 6) strongly suggests that the resistance mutations are not singly originated. The haplotypes carrying a kdr mutation can be convincingly clustered into two major clades or groups. Specifically, the complex reticulate network in Group 1 supports that divergent evolution may have occurred first, followed by diversification of 1014F haplotypes driven by selection pressure (H3-L → H11-F/H17-F; H6-L → H4-F/ H13-F), and eventually converged toward H4-F. By comparison, the evolutionary relationships for haplotypes in Group 2 are relatively straightforward and support multiple 1014F haplotype origins (H12-L → KF927156-F; H2-L → H1-F). Multiple origins of kdr mutations have also been documented in other insect species such as the house fly Musca domestica [31].
Multiple kdr mutations at position 1014 of the VGSC have been reported in several insects [32]. In A. sinensis, three types of kdr mutations (L1014F/C/S) have been documented [14,23,30]. The same mutations have been detected in a series of other anophelines [33,34]. After reviewing the available literature, we found that the classical L1014F appears to be the first selected kdr allele, probably a consequence of DDT application. With global use of pyrethroids, new alternative mutations may emerge [9]. Although the level of pyrethroid resistance conferred by L1014F and L1014H replacement in the house fly is clear [35], to our knowledge, there are no published data about the effect caused by different mutations at position 1014 on insecticide resistance in A. sinensis. Knowledge about insecticide tolerance and fitness associated with different resistance allele will be of importance to understand the evolution of kdr mutations and helpful for developing pest control strategies.
Our network analysis indicated that only one mutational step is able to change TTT-encoding 1014F to 1014C. This observation confirms a sequential evolution from 1014F to 1014C in A. sinensis as proposed in our previous paper [23]. The codon usage may explain this notion: one mutational step can lead to the L1014F (TTG to TTT) or F1014C (TTT to TGT) substitution, while it requires two base mutations to realize the L1014C (TTG to TGT) replacement. However, this is not the case for the two resistance mutations (L1014F and L1014H) in house flies: the 1014F (TTT) allele could not evolve into the 1014H (CAT) allele (or vice versa) [9], because it would require two nucleotide substitutions.

Conclusions
This study adopted the NGS platform for cost-effectively clarifying VGSC haplotypes of A. sinensis and provided information about the types and distribution of the kdr-type mutations at the haplotype level in Sichuan, China. Our results support multiple origins of kdr alleles and confirm a scenario of sequential evolution from TTT-encoding 1014F to 1014C in A. sinensis. These findings will advance our understanding about the evolutionary history of VGSC resistance haplotypes in the malaria vector A. sinensis.