Quantitative Imagery Analysis of Spot Patterns for Haplogroup Classi cation of Triatoma Dimidiata (Latreille, 1811) (Hemiptera: Reduviidae) an Important Vector of Chagas Disease

Daryl David Cruz Flores (  daryldavidcf@gmail.com ) Centro de Investigación en Biodiversidad y Conservación https://orcid.org/0000-0002-7714-2459 Dennis Denis Ávila Universidad de la Habana Facultad de Biologia Elizabeth Arellano Centro de Investigacion en Biodiversidad y Conservacion Carlos N. Ibarra-Cerdeña CINVESTAV IPN: Centro de Investigacion y de Estudios Avanzados del Instituto Politecnico Nacional


Background
Genetic and morphological divergences associated with speciation processes may not appear at the same time or progress at the same rate [1]. The emergence of new species usually results from the isolation of populations due to geographic, ecological, or behavioral barriers, which act individually or synergistically [2]. This can lead to populations that have substantial genetic differentiation that has not been expressed phenotypically (at least not obviously), giving rise to cryptic species [3]. Identifying cryptic species complexes is one of the most important challenges facing taxonomy in recent years [4].
The correct delimitation of cryptic species has important implications for research in many elds of biology, such as studies on biodiversity, conservation, and behavioral ecology [4]. Frequently, these cryptic species delimitations are achieved using different types of data such as molecular, ecological, behavioral, and geometric morphometric data [5]. This combination of methods, known as integrative taxonomy [6], is the surest and most precise way of determining species limits [7,8].
Cryptic species in the genus Triatoma (main vectors of Chagas disease) have mainly been recognized using molecular tools [9,10,11,12], although both ecological and morphometric analyses have also been used [13]. Within Triatoma, the dimidiata complex has received considerable attention, in part because it is one of the most widely distributed triatomine species complexes. It is the only triatomine bug that naturally occurs throughout the northern neotropical realm of North, Central, and South America [14].
Within the dimidiata complex, at least ve new species have been proposed based on genetic data [15]. Furthermore, the species in this complex have different morphological patterns [16,17]. In the eld of epidemiological entomology, the delimitation of species of medical importance is vital for the establishment of e cient control strategies [13,18]. Geometric morphometric techniques using landmarks [19,20,21,22] or body contour descriptors [23,24,25,26] have been used for this purpose, mainly because of its superiority over traditional morphometric methods [27] and because it is a cheaper method than, for example, molecular ones.
Spot patterns are widely used to describe species in traditional taxonomy [28]. However, because spot pattern is highly variable due to its ecological functions, it is usually described in subjective, qualitative terms. Alternatively, using digital tools to quantify spot patterns can minimize bias, increase precision, and allow automated identi cation processes [29]. However, since few studies use quantitative measurements of pattern properties for taxonomic purposes, evidence on the usefulness of color patterns to separate (or discriminate) species is still lacking [30].
The general body color of triatomines is black or spruce, with pattern elements ranging from light yellow to light brown, orange, or red shades [31]. The lighter pattern elements can be present on any area of the body or appendages, and the color, intensity, and distribution of these elements are of considerable importance for systematic purposes. The pattern of the connexivum region is particularly marked [32].
However, despite their taxonomic importance, there have been very few quantitative studies of color and pattern variation in Triatoma. The rst study to quantify color patterns in a species of Triatoma was by [33], who analyzed the melanic and non-melanic forms of domestic and peridomestic populations of Triatoma infestans, and [34], in addition to other aspects, explored pattern variation as a function of elevation in triatomines from El Salvador, including populations of T. dimidiata. To date, the utility of the spot pattern to discriminate among haplogroups within the dimidiata complex or any other triatomine complex has not been explored.
Although there are no obvious external morphological differences based on our observations of dimidiata complex specimens, we hypothesize that the evaluation of more detailed quantitative differences in the spot pattern between haplogroups could be used to distinguish them morphologically. This method could potentially be useful to improve the separation criteria for cryptic species in this group without requiring genetic data. In this work, we evaluate the reliability of discriminating among three T. dimidiata haplogroups using the dorsal spot pattern.
If successful, this technique could be extended to other species in this (and other) group and lay the foundations for an automated identi cation system to facilitate correct species recognition within the genus Triatoma.

Sample information
Images of individuals from each of the haplogroups of Triatoma dimidiata were obtained from Gurgel-Gonçalves et al. [20]. These images are part of a collection of images of 51 triatomine species from Mexico and Brazil available for public use in the Dryad repository (http://dx.doi.org/10.5061/dryad.br14k). The original series of images that represent the species distributed in Mexico was taken from the following entomological collections in Mexico: Regional Center for Health Research, National Institute of Public Health of Mexico, Guanajuato State Public Health Laboratory, Benito Juárez Autonomous University of Oaxaca, and the Autonomous University of Nuevo León, Monterrey, and details of how these images were taken are described in the referenced publication. We obtained a total of 44, 30, and 40 images of individuals belonging to haplogroups 1, 2, and 3 respectively; the haplogroup assignment of these individuals was corroborated genetically, and this corroboration constitute a major factor to use this images in quantitative analysis like our work. From the 114 images, we selected only high-quality images that clearly captured the spot pattern, eliminating the cases where the spots were fused or covered by hyperchromatic wings (Additional le 1). This resulted in a nal sample of 101 images (39 from H1, 23 from H2, and 39 from H3).

Image processing
The images were processed to facilitate the extraction of standardized measurements of the spot pattern ( Fig. 1). The abdomens were clipped manually, removing the legs and cutting off the head at the thorax level. Subsequently, the images were aligned and re-scaled, using the insertion angles of the abdomen and thorax and the back of the body as references for alignment and scaling all individuals to the width of the rst individual that was taken as a reference (image H10355). These transformations may slightly alter the shape and absolute values of the spot measurements, but they are essential to standardize the spatial patterns of the spots and make them comparable, eliminating differences due to body shape or size, whose identifying value has been tested in previous works [20,26]. For this reason, the quantitative estimates of areas were always expressed relative to the total area of the abdomen and linear measurements are relative to the square root of the total abdomen area.
Processing for spot pattern extraction included removing color information (desaturation) and reduction of levels to the central 50% of the image histogram. In some cases, noise produced by surface re ectance of the specimens or shadows that arti cially connected adjacent spots during the binarization of the images were manually eliminated.
In the ImageJ program [35] a macro (Additional le 2) was programmed to automate image processing and measurements. This included 8-bit image conversion, binarization with a minimum automatic threshold, background removal, mask conversion, and gap lling. The outlier points, both black and white (using radius 6 and threshold 50) were then removed and the resulting particles (spots) were measured.  [20] Heat maps were obtained by superimposing the images of the spot patterns of all individuals per haplogroup, using the PAT-GEOM v1.0.0 package, developed by Chan et al. [36]. This package allows the analysis of different measures of the coloration pattern quantitatively, and it was designed to work with macros on ImageJ. These maps allowed us to visually explore and qualitatively describe the general patterns that characterized each haplogroup.

Quantitative characterization of the spot pattern
The spots were numbered consecutively for identi cation; spots 1 and 2 were the central spots, and the spots on the edge of the abdomen were numbered with consecutive odd numbers on the left and even numbers on the right. To quantitatively describe the pattern of spots, a series of primary variables were taken at the spot level, as well as derived variables that included both the spot and individual levels.
The variables measured are shown in Fig. 2. The total body area (Ta) was used for standardization purposes only. The relative area (Ra) was the area of each spot relativized as a percentage of Ta (%). The sum of the Euclidean distances (SED) was calculated by taking the centroid coordinate of each spot and calculating, at the individual level, the distance between the central and lateral spots after making a Procrustes record of the complete con gurations. The maximum and minimum Feret diameters (MaxFd and MinFd, respectively), as well as the Feret angle (Fa), were calculated for each spot. These variables refer to the maximum and minimum distances between any pair of contour points of a shape, and although they are identi ed as diameter, they are not strictly analogous to a diameter, since they do not pass through the center of the gure or divide it into symmetrical sections. The Fa refers to the angle of the vector of the MaxFd and indicates the general directionality of the spot (its inclination). The aspect ratio (Ar) of each spot (ratio of the minor to the major diameter) was used as an indicator of its shape.
For each individual, the averages of the variables per spot, the sum of the total Ra of the spots, and the ratio of the mean Ra of the central spots to the lateral spots were calculated as derived variables. For the calculation of the average inclination angle, both for the central and lateral spots, the angles of the spots from the left to the right quadrant (0-90º) were re ected.

Data analysis
Non-parametric descriptive statistics (median, quartiles, and range) were used because the distribution of the data was not normal, and traditional descriptors gave a false impression of precision and marked differences. Statistical comparisons among haplogroups were done using Kruskal-Wallis tests in Statistica v8 software. Also, a Linear Discriminant Function Analysis (forward stepwise) (LDFA) was performed to estimate the ability to discriminate haplogroups based on the variables used. Since this method has a series of restrictive premises and can only linearly differentiate the groups, a multilayer perceptron type neural classi cation network was used as an alternative method. Neural networks are supervised machine learning procedures and do not have statistical premises on the nature of the data, making them more powerful and capable of exploring nonlinear relationships in complex sets of variables. The most e cient topology for the network was found by the automated search procedure of the Statistica 8.0 software, considering all of the variables analyzed. The network was trained with 60% of the individuals by haplogroup and validated with the remaining 40%. Assignment to each group was random, except for individuals wrongly classi ed by the LDFA, who were forced into the validation sample for a more robust check of network performance. The weight assigned by the neural network to each variable was estimated to identify those of greatest importance in the discrimination process.

Results
The heat maps generated by superimposing all the individuals within each haplogroup show the spot patterns that characterize each haplogroup and evidence a well-differentiated pattern between them (Fig. 3). The most differentiated pattern was presented by haplogroup 2, mainly apparent in the notably larger central spots. Haplogroups 1 and 3 were more similar to each other, but there were consistent differences in the shape and orientation of the spots.
The ratio of spotted area to Ta differed among haplogroups. The highest Ra was presented by haplogroup 2 with 15.6%, while haplogroup 3 had only 8.7% (Fig. 4A). When comparing the ratio of the area of the central spots to the lateral spots, haplogroups 1 and 3 had higher relative lateral spot areas. In haplogroup 2, the lateral and central spots contributed almost equally to the total spot area, while the percentage of the central spots area was slightly higher. When statistically comparing the mean Ra of the central and lateral spots, only the central spot area differed signi cantly among haplogroups ( Fig. 4B and C).
The average spot size, characterized by Feret diameters, was signi cantly different among haplogroups, both for the central and lateral spots (Fig. 5). In the case of the central spots, haplogroup 2 was the most strongly differentiated (Fig. 5A), while for the lateral spots, haplogroup 1 presented the most notable differences (Fig. 5B).
The orientation of the abdominal spots, expressed by the Fa, were markedly different between haplogroup 1 and the other groups. The largest differences were observed in the orientations of the rst three pairs of spots (3/4, 5/6, and 7/8), which tended to be more forward oriented. For the remaining spots, although differences in orientation were observed, these were less noticeable, both in the Fa value and in its variation among individuals (Fig. 6).
When comparing the mean orientations of the lateral spots of the abdomen (Fa), signi cant differences were found between the three haplogroups. Haplogroup 1 was the most distinct and had less variation in Fa than the remaining haplogroups (Fig. 7).
The shapes of the central and lateral spots (Ar), differed among haplogroups, both for the lateral and central spots (Fig. 8). The shape of the central spots in haplogroup 2 showed the greatest differences among the haplogroups, while the most differentiated lateral spots were from haplogroup 3.
The most e cient neural classi cation network had a topology with 20 neurons in the hidden layer. This achieved an overall performance of 94.7% with a BFGS-12 training algorithm and an entropy error function. 100% of the training data were correctly identi ed, and considering only the validation data, the correct identi cation was 87.2% of the individuals. The hidden layer had sine activation functions and the output layer logistic functions, with Sum of Squares as an error function. This network achieved 100% correct classi cation of H2 specimens and misclassi ed three H1 individuals (H1 0367, H1 0372 and H1 0374; from a total of 16 in the validation sample) as H3, and two H3 individuals (H3 0847 and H3 0862) as H2. The remaining three individuals that had been incorrectly classi ed by LDFA were correctly assigned to their haplogroups by the neural network (H2 0504; H3 0388 and H3 0395).
Classi cation methods made similar use of variables. The LDFA used only ve variables in the nal model: the size of the central spots, the shape, angle, and diameter of the lateral spots and the total relative spot area. The neural network assigned greater importance to these same variables, and additionally included the relative area of the lateral spots.
When analyzing the weights assigned to each variable used in the neural network procedure (Fig. 10), the most important in the classi cation process was the Feret diameter of the lateral spots and the aspect ratio of the lateral spots, respectively. The variable that contributed the least to the classi cation was the ratio of the central spot area to lateral spot area.

Discussion
The cryptic dimidiata species complex has been largely supported using molecular tools, which has led to the identi cation of three phylogenetically well-differentiated haplogroups in Mexico and part of Central America [9,11,12], and two taxa have been formally described as new species [37,38]. For the rst time, we used the spot pattern presented by this complex to discriminate among haplogroups (possible cryptic species) by extracting and analyzing quanti able variables from digital images.
Our results demonstrate the ability to use these measures to correctly recognize the haplogroups analyzed. Of the variables used for discrimination, only one (mean relative area of the lateral spots) did not differ signi cantly among haplogroups, indicating that overall, pattern variables were useful for delimitation. This was veri ed both by the discriminant analysis ordination plot and the results obtained by the most e cient neural network.
The study of coloration in triatomines and its application in taxonomy has mainly been used in traditional qualitative approaches [32]. This has led to the assumption of a lack of clear morphological diagnostic characters to facilitate recognition and formal descriptions at the species level [15]. However, using heat maps, three well-differentiated spot patterns were evident, corresponding with the three haplogroups. This, once again, highlights the importance of using quantitative tools to study complex patterns such as coloration, where subtle aspects such as the orientation of groups of spots or other patterns may not be apparent or easily distinguishable to a human observer.
The variation found among the haplogroups in spot pattern may respond to different processes. In other groups of insects such as butter ies, coloration patterns have been shown to vary depending on environmental conditions such as temperature [39,40], associated with processes of genetic assimilation of phenotypic changes [41]. Although there are populations where the three T. dimidiata haplogroups analyzed in this study are found sympatrically [see 11], their distributions are mostly allopatric, therefore the pattern of variation among these haplogroups may re ect adaptation to environments with different characteristics in response to environmental stress. Recently, genetic assimilation in the evolution of phenotypic plasticity has been demonstrated not only for butter ies but also for various groups of organisms [42,43,44,45]. However, corroborating this phenomenon in T. dimidiata will require speci cally designed studies.
Another important aspect that this research demonstrates is the value of the combination of digital image analysis and machine learning for taxonomy [46]. The potential of this combination of approaches in species delimitation has been broadly demonstrated [47]. However, even though its utility is clear and, in many cases, superior to the traditional taxonomy, it is still relatively rarely used.
Classical taxonomy is a science that is practically in danger of extinction, especially due to the lack of expert taxonomists and specialists in species identi cation, which require many years of training and experience [48]. In the era of big data, image pattern recognition is a new technology that provides many potential advantages for taxonomists, including speeding up and automating the classi cation process, reducing error, and assimilating quantitative information that would be impossible for a human observer [49].
Speci cally, with triatomines, there have been recent efforts to employ these methods to establish automated identi cation systems. These include the studies of Gurgel-Goncalves et al. [20] and Khalighifar et al. [50], who, using geometric morphometry techniques and deep learning algorithms, respectively, have taken the rst steps in this regard. In another example, Cruz et al. [26] were able to discriminate T. dimidiata haplogroups with high correct discrimination values by characterizing the entire body contour using Fourier Elliptical Descriptors, and this same method has been successfully used to generate automated identi cation systems in other groups of insects [51]. The integration of this method with the analysis of the spot pattern is potentially a novel and powerful tool to generate a computerbased approach for species identi cation in cryptic groups. Although classi cation processes still need improvement, these novel works bring new challenges and perspectives in the eld of epidemiological entomology, and the integration of methods should be a central aspect in the future of automated identi cation in this group, given its epidemiological importance, as well as in other groups of insects.
Although this research was focused on evaluating the possibility of correctly discriminating three haplogroups of T. dimidiata using the dorsal spot pattern, the value of coloration patterns in species biology cannot be forgotten. Color in insects has important biological functions including mate choice, intra-sexual competition, dominance relationships, and other social interactions [52]. Therefore, its study is relevant in many contexts beyond taxonomy, and research should be increased to explore the role that coloration patterns play in nature. In relation to other groups of insects such as Coleoptera, Lepidoptera and Hymenoptera, in Hemiptera and especially the Triatominae subfamily, there are very few studies associated with the coloring patterns. [52].

Conclusions
The importance of the correct recognition of insect species of epidemiological importance is vital for the establishment of good control measures [13,18]. The results obtained in this investigation allow us to conclude that the spot pattern in triatomines constitutes a signi cant source of information that can be used directly in the taxonomic analysis of this group of insects. If we consider that the haplogroups used here may constitute phylogenetically close cryptic species [9,11,53], similar pattern analysis in a larger number of less closely related species will likely nd larger, more easily distinguishable differences in the spotting pattern than those found here. A fruitful avenue for future research would be comparing spot patterns among multiple species in order to discriminate among them. If similarly, successful to this study, pattern recognition could allow, in the not too distant future, the development of a reliable automated identi cation system to use as a tool for the recognition of vectors of Chagas disease, one of the most important on the American continent.

Supplementary Files
This is a list of supplementary les associated with this preprint. Click to download.