Improved orthologous databases to ease protozoan targets inference

Background Homology inference helps on identifying similarities, as well as differences among organisms, which provides a better insight on how closely related one might be to another. In addition, comparative genomics pipelines are widely adopted tools designed using different bioinformatics applications and algorithms. In this article, we propose a methodology to build improved orthologous databases with the potential to aid on protozoan target identification, one of the many tasks which benefit from comparative genomics tools. Methods Our analyses are based on OrthoSearch, a comparative genomics pipeline originally designed to infer orthologs through protein-profile comparison, supported by an HMM, reciprocal best hits based approach. Our methodology allows OrthoSearch to confront two orthologous databases and to generate an improved new one. Such can be later used to infer potential protozoan targets through a similarity analysis against the human genome. Results The protein sequences of Cryptosporidium hominis, Entamoeba histolytica and Leishmania infantum genomes were comparatively analyzed against three orthologous databases: (i) EggNOG KOG, (ii) ProtozoaDB and (iii) Kegg Orthology (KO). That allowed us to create two new orthologous databases, “KO + EggNOG KOG” and “KO + EggNOG KOG + ProtozoaDB”, with 16,938 and 27,701 orthologous groups, respectively. Such new orthologous databases were used for a regular OrthoSearch run. By confronting “KO + EggNOG KOG” and “KO + EggNOG KOG + ProtozoaDB” databases and protozoan species we were able to detect the following total of orthologous groups and coverage (relation between the inferred orthologous groups and the species total number of proteins): Cryptosporidium hominis: 1,821 (11 %) and 3,254 (12 %); Entamoeba histolytica: 2,245 (13 %) and 5,305 (19 %); Leishmania infantum: 2,702 (16 %) and 4,760 (17 %). Using our HMM-based methodology and the largest created orthologous database, it was possible to infer 13 orthologous groups which represent potential protozoan targets; these were found because of our distant homology approach. We also provide the number of species-specific, pair-to-pair and core groups from such analyses, depicted in Venn diagrams. Conclusions The orthologous databases generated by our HMM-based methodology provide a broader dataset, with larger amounts of orthologous groups when compared to the original databases used as input. Those may be used for several homology inference analyses, annotation tasks and protozoan targets identification. Electronic supplementary material The online version of this article (doi:10.1186/s13071-015-1090-0) contains supplementary material, which is available to authorized users.


Background
Historically, the very definition of a Protozoa represents an open debate. Despite many classifications and changes provided over history [1][2][3][4], in this article we will refer to Protozoa as eukaryotic organisms, apart from those who do not have a primitive mitochondria, peroxisomes (Archezoa) and the shared characteristics which define the Animalia, Fungi, Plantae and Chromista kingdoms [4].
Neglected Tropical Diseases (NTDs) are diseases caused by a variety of organisms and are usually associated to developing countries, which suffer from poor sanitation, hygiene, social and financial conditions. Over 1 billion people are affected by such diseases, in 149 countries worldwide [11]. Among the 17 NTDs listed by WHO, three are caused by protozoan organisms: Chagas' disease (Trypanosoma cruzi), Human African Trypanosomosis (Trypanosoma brucei) and Leishmaniosis (Leishmania spp.) [11].
According to the 3 rd WHO report on NTDs, even though several advances have been achieved in the recent years, there is a permanent need for research and innovation in improved diagnosis, next-generation treatments and interventions for such NTDs [12,13].
Leishmaniosis is a neglected disease caused by the Leishmania spp. and transmitted by phlebotomine sandflies [14]. More than 1.3 million people are infected worldwide, especially those who live in poor sanitation, hygiene and social conditions and WHO estimates about 20,000 to 30,000 deaths occur yearly [14]. Such disease has three distinct presentations: cutaneous, visceral and mucocutaneous, each of them related to different Leishmania spp. and world regions.
Leishmaniosis is hard to diagnose and treat. So far, the available drugs and vaccines are either toxic or present poor efficiency [15]. Also, its elevated treatment cost (up to US$252/patient, depending on the applied treatment and drugs) eventually becomes prohibitive to the most affected and poor countries [15].
Besides allowing for a better comprehension of such Protozoa organism, many molecular studies have been done over the last few years using DNA/RNA sequencing methodologies [16][17][18], which have been used in order to infer new drug targets. These data are available in several public databases and allow for comparative genomics studies among either closely or distant related organisms. Also, that might increase the odds of discovering relevant information applied to drug manufacturing or reuse, which could be later applied to disease treatments.
Comparative genomics mainly refers to homology and evolutionary dynamics between organisms, genes and proteins, which provides better understanding on how species evolved through comparing either their complete genomes or specific genes [19]. Homologous genes share a common ancestry, either intra-or inter-species. Several scenarios relate to homology, such as orthology, paralogy, horizontal gene transfer, gene loss, orphan genes and others; for this study, we will focus on orthology aspects only [20].
Orthology might be inferred when the same genes or proteins are present in distinct species, and this was due to a speciation event [20].
Homology inference has become an important issue when inferring function to recently sequenced genes because orthologs tend to preserve their ancestor function.
Besides that, such studies provide a better insight on genes evolutionary history and consequently, to the species evolution [20,21].
Inferring putative function is one of the particular benefits in orthologous group (OG) assignment, especially when dealing with recently sequenced genome data [22]. In addition, OGs may provide us a better comprehension on species evolutionary relationships [23], since it is through such data that one might provide information that could help on both evolutionary and functional analysis [24].
Moreover, several tasks could benefit from OGs, such as genome annotation, gene conservation, protein family identification, phylogenetic tree reconstruction, pharmacology and many others [22,[24][25][26][27]. Topics as positional orthology and synteny conservation among orthologs are also appealing to those who aggregate genomic context in their homology inference methods [28].
There are several available methodologies to aid on homology detection. Besides a simple categorization effort [26], we will follow Dalquen's proposition [29]. Briefly, three distinct approaches are available: (i) the one which use multiple sequence alignment (MSA) scores along with reciprocal best hits, such as OrthoSearch [30], OrthoMCL [24] and InParanoid [31]; (ii) that which rely on evolutionary distance calculus, as RSD [32,33]; (iii) and that based on phylogenetic trees reconstruction, as SPIMAP [34].
OrthoSearch [40] is a scientific workflow [41] for homology inference among species. Initially conceived as a Perlbased routine, it uses a reciprocal best hits, HMM-based approach. OrthoSearch has already proven to be effective inferring orthology among five protozoan genomes, using COG and KOG ODs [27].
In this work, we propose an update and a new functionality for OrthoSearch, showing it as an effective tool in providing means to create new ODs (n-ODs). So far, we tested our methodology in a controlled, three steps scenario: (i) Protozoa orthology inference and (ii) n-ODs creation, both supported by publicly available ODs used as input; and (iii) improved Protozoa orthology inference, supported by such recently created n-ODs.
With our methodology and generated n-ODs, we expect to be able to provide ODs with broader data sets, which in turn can be applied in target identification for protozoan organisms, such as stated by Timmers et al. [42] review on research efforts related to genomic database development for protozoan parasites.
Moreover, previous initiatives, such as the study performed by Tschoeke et al. [18] regarding the Leishmania amazonensis parasite, as well as the Leishmania donovani comparative genomics analysis performed by Satheesh et al. [43] corroborate the benefits provided by the use of broader orthologous data sets.

OrthoSearch improvements and analyses scenarios
In order to reach our main methodological goal, which is to provide OrthoSearch with means to create n-ODs, we revisited its original pipeline. Notably (i) we adopted HMMER version 3 and (ii) changed from a Perl-based routine to C++ 4.43 and Ruby 1.8.7 modules. A dedicated Ubuntu 12.04 single-server machine with 64 cores and 32GB RAM was used for all assembled scenarios.

OrthoSearch for protozoa orthology inference
OrthoSearch needs as input data an (i) OD and (ii) an organism multifasta protein data. We used Kegg Orthology (KO) [44,45], EggNOG KOG and ProtozoaDB as input ODs. KO, downloaded via FTP, contains data from all life domains -Archaea, Bacteria and Eukarya. Egg-NOG KOG is a eukaryotic-only groups Eggnog subset [39], downloaded directly from its website. ProtozoaDB OGs [46] (which contain only protozoan species) were also used for our analyses. Details about each OD are available in Additional file 1.
We randomly selected three protozoan species as Ortho-Search organisms input data: Cryptosporidium hominis, Entamoeba histolytica and Leishmania infantum, each with 3,885, 7,973 and 7,872 proteins and downloaded such data from ProtozoaDB [46].
OrthoSearch for orthologous database building Figure 1 depicts OrthoSearch pipeline with its two possibilities, which could be (i) a standard orthology inference or (ii) a n-OD creation. In order to build the n-ODs, we used as input data an OD in its original composition; and another OD data subset, which enacts as an organism multifasta protein data.
That was called an impersonated proteome and was generated by choosing a representative protein for each OG at the confronted OD. The selection and extraction of such proteins was performed with the support of a Python script kindly developed by Salvador Capella (personal communication, URL: https://github.com/scapella/trimal/blob/dev/ scripts/get_sequence_representative_from_alignment.py) and an internally developed Ruby script. Each representative protein identifier and its amino acid sequence were stored in a single multifasta file (see Additional file 2).
We started with KO as our fixed, complete OD, against an impersonated EggNOG KOG multifasta protein data. Reciprocal best hits between both of them were processed using internal scripts developed in both Ruby and Unix/ POSIX shell script languages, so that, new OGs were created and arranged, generating the n-OD conveniently named "KO + EggNOG KOG".
During such n-OD creation, there were three possible scenarios related to the fixed database OGs -KOand the impersonated proteome. Briefly, (i) those OGs from KO database which did not have any reciprocal best hit with the impersonated OGs from EggNOG KOG; and (ii) those OGs from EggNOG KOG that did not have any reciprocal best hit against KO database were identified, selected and incorporated to our n-OD ("KO + EggNOG KOG"), without any changes; (iii) those OGs from KO database which presented a reciprocal best hit with the impersonated OGs from EggNOG KOG were expanded, by adding up every respective EggNOG KOG OG proteins into such KO group.
Therefore, at the end of this run, we had a n-OD called "KO + EggNOG KOG" which comprised original, unaltered KO and EggNOG KOG OGs, as well as expanded KO OGs, which now also contain EggNOG KOG protein data. Once such n-OD was created, we performed the same steps described above, confronting "KO + EggNOG KOG" against ProtozoaDB impersonated OGs, which generated our second n-OD called "KO + EggNOG KOG + ProtozoaDB".

Inferring protozoan orthologs with OrthoSearch and the n-ODs
Having built the two n-ODs: "KO + EggNOG KOG" and "KO + EggNOG KOG + ProtozoaDB", we analyzed OrthoSearch in a standard orthology inference against the three above mentioned protozoan species.

Comparison between the n-ODs and OrthoMCLDB
The three protozoan species were confronted against OrthoMCLDB through online phyletic pattern search queries, in order to infer its orthologous proteins, in the same way as we did with the n-ODs created by our proposed methodology.
Quantitative results obtained while executing both OrthoSearch (with the two n-ODs) and OrthoMCLDB against the three protozoan species were compared in order to offer a better understanding of the proposed methodology behavior and to analyze if we were able to provide better results or not.
Potential Leishmania spp. targets against the human proteome In order to identify potential protozoan targets that are not available at the human genome, a BlastP [47] was performed between the largest n-OD generated by our methodology -"KO + Eggnog KOG + ProtozoaDB" (details on the orthologous groups proteins are available at Additional file 3) and the human proteome, downloaded via RefSeq [48]. We used BlastP 2.2.28+ with 0.1 as e-value, extracted and analyzed the orthologous groups which did not perform any hit against the human proteome but provided results against Leishmania spp. and therefore could represent potential targets. A BlastP was also performed against KO, Eggnog KOG and Proto-zoaDB orthologous databases separately.

OrthoSearch for protozoa orthology inference
The protein data of the three protozoan species were confronted against (i) KO, (ii) EggNOG KOG and (iii) ProtozoaDB ODs (Fig. 2). ProtozoaDB performed best, With such data, we extracted coverage percentage information, which shows the total number of OGs inferred by OrthoSearch versus how many OGs are contained within each OD. For Cryptosporidium hominis, which has the smallest number of proteins of the three protozoan species studied, EggNOG KOG performed best, with 33 % coverage. Entamoeba histolytica also performed well with EggNOG KOG (37 %), but showed very similar results with ProtozoaDB (36 %), while showing a poor coverage with KO (13 %). Finally, Leishmania infantum had the best coverage (47 %), with EggNOG KOG.
Internal scripts, developed with the R language and its Venn Diagram library, processed reciprocal best hits for such protozoan species. We identified species-specific, pair-to-pair and core OGs, depicted at Fig. 3.
In addition, ProtozoaDB presented the best speciesspecific results, with Entamoeba histolytica performing 40.77 % of the total OGs (4,086/9,979); Leishmania infantum with 27.93 % (2,787/9,979); and Cryptosporidium hominis with 19.68 % (1,964/9,979). Table 1 shows details on how many OGs remained intact and directly migrated to the n-ODs created by our methodology as well as those that were expanded.

OrthoSearch for Orthologous Database building
After "KO + EggNOG KOG" building, we had a 14.02 % increase in the total number of OGs when compared to KO (16,   shows protozoan species representation at each created n-OD.

Inferring protozoan orthologs with OrthoSearch and the n-ODs
With these recently created n-ODs, on our second scenario we executed OrthoSearch using as input such n-ODs and the same three protozoan species then compared the obtained results against previous KO analysis. Figure 4 depicts coverage percentage data for each of the OG databases created by the methodology itself, for each organism adopted. Our methodology provided an 86.47 % increase on the total number of OGs and a 22.45 % on the total amount of proteins when comparing "KO + EggNOG KOG + ProtozoaDB" against KO. Although there was a relevant increase in the number of inferred OGs for Cryptosporidium hominis (from 1,499 up to 3,254 groups), coverage increase was very subtle (10 %-12 %). Entamoeba histolytica, on the other hand, shows a relevant increase in coverage (19 %), especially when confronted against "KO + EggNOG KOG + ProtozoaDB" n-OD. Leishmania infantum had a very similar behavior to Cryptosporidium hominis, with a total of up to 4,760 OGs and from 15 % up to 17 % coverage.
We also consolidated the reciprocal best hits obtained in Venn diagrams, so that we might have a glimpse on species-specific, pair-to-pair and core OGs, as shown in Fig. 5.
Concerning OGs respectively. Figure 6 shows a Venn diagram with obtained results.
Potential Leishmania spp. targets against the human genome A BlastP against our largest created n-OD, "KO + Eggnog KOG + ProtozoaDB" (27,701 orthologous groups) allowed us to infer 7,622 (27.5 %) orthologous groups which did not perform any hit against the human proteome. Among such, 6.5 % (1,805/27,701) groups belong to KO or Eggnog KOG, but are not available in Proto-zoaDB, which contains only protozoan organisms (Leishmania spp. included). Furthermore, 13 orthologous groups (0.05 %) contain at least one Leishmania spp. (Table 4), that should be considered as potential targets for further analysis.
The same BlastP query against each of the original ODs provided us the results listed in Table 5. These groups have no similarity with the human proteome and have at least one Leishmania spp. sequence.

Discussion
In this analysis, we adopted new programming languages and updated the OrthoSearch pipeline with several bioinformatics tools, rewriting it to be later used in homology inference analyses and n-ODs creation.
OrthoSearch uses an algorithm based on reciprocal best hits calculation via HMM profiles, with Mafft being  OrthoSearch execution with KO OD provides a significantly small core compared to KO size and the total number of best hits. That could be explained as KO contains proteins from many evolutionarily distant organisms, what could pose a challenge in the identification of closely related OGs.
Later, EggNOG KOG OD provided a discrete increase in the obtained protein core, most likely due to EggNOG KOG having only eukaryotic organisms' data. While Pro-tozoaDB OD provided the smallest core among the three ODs, the total number of species-specific protein is extremely higher. This could be due to the reduced number of species in ProtozoaDB OD, along with the fact that all of those are protozoan organisms. Basically, the odds of obtaining a hit with a protein belonging to the own species being analyzed within OrthoSearch could increase.
We opted to choose a representative protein for each OG at the confronted OD and impersonate an organism multifasta protein data because that could minimize the required computational power and time needed to run OrthoSearch analyses.
Since an OD contains several OGs, which also contains several proteins, that would easily escalate the required time to confront such ODs. In addition, as each OG contains two or more proteins usually from closely related organisms, that could imply the possibility of two (or more) distinct proteins from the same OG obtaining a hit with distinct OGs at the confronted OD.
Our scenarios for n-OD creation were based on KO, EggNOG KOG and ProtozoaDB ODs. According to the literature, each of these ODs were created through particular methodologies: the use of metabolic pathways (KO), heuristic approaches and Gene Ontology [49] support (EggNOG KOG) and OrthoMCL algorithm (ProtozoaDB).
Our methodology allowed us to create n-ODs that either contain intact OGs which originated from the source or the confronted ODs or expanded OGs from the obtained reciprocal best hits inferred by Ortho-Search. The intact OGs contribution relates to offering more OGs for further analyses, while expanded Fig. 4 OrthoSearch inferred orthologous groups and coverage, per organism, with the databases created by the methodology itself; A detailed view on how many orthologous groups were inferred with (i) "KO + EggNOG KOG" and (ii) "KO + Eggnog KOG + ProtozoaDB" databases and what do such numbers represent against the organisms total protein numbers ones provide more variability than those OGs from the original databases.
Besides providing a means to improve ProtozoaDB orthology inference, we opted to begin our n-OD creation tasks with KO and Eggnog due to both database variabilityproteins from organisms from all life domainsand size. We also decided to maintain the original orthologous groups database identifiers, as well as their functional annotation. That might ease further steps related to information provenance.
Our proposed methodology works as a non-intrusive approach to a HMM-based pipeline -OrthoSearchwithout changing its core functions. It uses ODs as input data and is capable to create n-ODs without requiring extensive computational power.
When looking at how protozoan species data fit to our proposed n-OD creation methodology, we observe a 61.98 % increase from KO to "KO + EggNOG KOG" OD, scaling from 3,612 up to 5,851 OGs with at least one Protozoa protein. In addition, the total number of protozoan proteins had a 162 % increase (from 46,027 up to 74,630) in such n-OD.  Furthermore, when KO and "KO + EggNOG KOG + ProtozoaDB" ODs were compared, we identified a 379 % increase in the number of OGs that contain at least one Protozoa protein (from 3,612 up to 17,305) and a 300 % increase in the total number of protozoan proteins (from 46,027 up to 138,814).
A broader dataset is usually desirable, as it may increase the odds of obtaining hits while inferring homology. As more organisms contribute to a n-OD, one might be able to obtain more hits with regular OrthoSearch runs confronting n-ODs and organisms multifasta protein data.
OrthoSearch uses a Markov chain based approach in order to create the n-OD OGs, which tend to comprise more evolutionary distant orthologous proteins than BLAST-based methodologies, such as OrthoMCLDB.
With the n-ODs created within our methodology, our initiative is another step to reinforce possibilities to build a gold, reference dataset [50] for orthology inference.
As more OGs (and respective proteins) from distinct species are added up to the n-ODs created, the methodology offers a broader dataset, with more data variability. Such data may be used for further homology inference analyses, which is a very desirable aspect in several comparative genomics applications. For example, phylogenomic studies which try to address gene conflicts and allow for optimal tree construction [51] or even review species' definition [52].
The obtained results point towards the success of our proposed methodology, which encourage us in refining and creating more n-ODs. Such n-ODs might also be used in order to improve future functional annotation, re-annotation and also potential targets identification.
When looking at pairwise groups, there is a change in this scenario. Our methodology either provides the same quantitative results as OrthoMCLDB (Cryptosporidium hominis and Entamoeba histolytica -162 OGs) or better, with 17.88 % more OGs (Entamoeba histolytica and Leishmania infantum -435/369) and up to 48.81 % more OGs (Cryptosporidium hominis and Leishmania infantum -378/254).
OrthoMCLDB inferred a larger absolute number of OGs in the three protozoan species core (760) than our methodology (627), which may be related to a broader seed of OGs (124,740/27,701). On the other hand, OrthoMCLDB performed poorly in coverage aspect, with only 0.06 % OGs in its core (760/124,740), while our methodology covered 2.26 % OGs (627/27,701), with our largest n-OD ("KO + EggNOG KOG + ProtozoaDB"). This may be due to the fact that the studied species are not so closely related. In addition, OrthoMCLDB uses a Blast-based algorithm, in a less sensitive approach than our methodology (OrthoSearch uses a protein-profile comparison) [9,32].
Our methodology provides means for improved orthologous database creation using a HMM-based approach. Those new databases may contain a greater set of evolutionary distant homologous proteins, which could further extend the odds of inferring knowledge regarding the target organisms.
Specifically, our analyses allowed for a better comprehension on three protozoan species, as well as a deeper analysis on potential targets. For example, the obtained protozoan core orthologous proteins may allow us to evaluate which of these are housekeeping proteins and how they relate to the organism fitness.
Also, the species specific proteinsthose which do not belong to the core, or those shared between two of the three studied protozoan organisms might be explored either as species-specific or group-specific targets, respectively.
The obtained BlastP results allowed us to infer orthologous groups which contain protozoan proteinsspecifically Leishmania spp. -that could be used as potential targets for further analysis, as they posed no hit against the human proteome.
Among the Leishmania spp. inferred orthologous groups without hits against the human proteome (Table 4) are proteins already described in the literature as possible drug targets, briefly: trypanothione [53] (K01833.cdhit)which relates to defense against oxidative stress [54]; and alpha-1,3-mannosyltransferase [55,56] (K13690.cdhit)enzyme essential to add mannose on the glycosylphosphatidyl, relates to the growing resistance to miltefosine. However, there are also other proteins, not yet described as drug targets, which should be further studied, briefly: the energy-converting hydrogenase B subunit I [55] (K14118.cdhit), found in the Archaea organism Methanothermobacter thermautotrophicus, which belongs to a domain related to MnhB subunit of Na+/H+ antiporter and is predicted as an integral membrane protein [57,58]; and galactofuranosyltransferase [59] (K13672.cdhit), related to the LPG1 gene, which acts as a major ligand for macrophage adhesion [60,61]. Our methodology also provided means to allocate new and evolutionary distant proteins to the original orthologous groups' databases, identifying orthology relationships which have not been previously described.
Even though this is a preliminary analysis, it allowed us to evaluate the applied methodology and to forecast how its results may be used for protozoan target identification, either in a species-specific or shared point-of-view. This methodology will be later applied to all of 22 ProtozoaDB [46] protozoan organisms.