Knowledge of the transcriptomes and proteomes of different developmental stages of a parasite, its vector and its definitive host is central to gaining an enhanced understanding of the molecular mechanisms that govern essential biological, infection and disease processes and, ultimately, could assist in identifying possible avenues for the development of novel intervention strategies. Accurate bioinformatic analyses of nucleic acid and protein sequence data (often by comparison with or inference from reference organisms) are crucial, in the absence of information for the organism under study, in providing biological meaningful molecular biological information about CVBDs. Until recently, detailed bioinformatic analyses of such datasets have been restricted largely to specialized laboratories with substantial computer and software capacities. The development of flexible and practical bioinformatic workflow systems is beginning to provide scientists with user-friendly tools for the analysis of massive datasets.
Currently, due to a lack of complete genomic sequences for many pathogens and vectors (and different strains thereof) associated with CVBDs, newly generated sequence datasets need to be assembled de novo, which means that pooled reads are assembled without a bias towards known sequences . Due to the amount of RNA required for NGS (~5-10 μg) , transcriptomes usually originate from numerous individuals, potentially leading to an increased complexity of the sequence data acquired (linked, for instance, to a biased nucleotide content, single nucleotide polymorphisms [SNPs] and other types of sequence variation) and sometimes posing challenges for the data assembly. In terms of complexity, computational and time requirements, de novo assemblies are much slower and more computer-memory intensive than knowledge-based (mapping) assemblies, in which reads are aligned and assembled against an existing reference sequence (representing the same species or genetic variant) . In addition, reliable de novo assemblies are highly dependent upon the availability of long reads (>100 bases) and of high-coverage, paired-end sequence data . In previous studies, the complementary nature of the 454 and Illumina sequencing platforms has allowed the assembly of raw reads into large scaffolds without a need for a reference sequence [76–78].
In the absence of reference genomes for agents and vectors linked to CVBDs, accurate assembly of sequence data is a crucial step in examining coding genes and, ultimately, addressing biological questions regarding gene and protein functions. Functions are initially predicted by 'sequence annotation' (= the process of gathering all available information and relating it to the sequence assembly both by experimental and computational means . Accurate annotation is dependent on the efficiency of the updates and curation. Presently, open-source programs and databases routinely employed for the bioinformatic analyses of sequence data are available via multiple portals, thus requiring significant efforts to maintain accurate and up-to-date assembly and annotation pipelines . In addition, the rate at which public databases are updated and corrected varies considerably. For instance, the Swiss-Prot database http://au.expasy.org/sprot/ accepts corrections from its user community, whereas GenBank http://www.ncbi.nlm.nih.gov/genbank/ only accepts corrections from the author of an entry , thus significantly affecting the accuracy and speed with which new sequences are annotated. In addition, some information-management systems incorporate data from large-scale projects, but often, the annotation of single records from the literature is slow . Given that, presently, the annotation of sequence data for parasites and vectors relies heavily on the use of bioinformatic approaches and already annotated/curated sequence data for a wide range of organisms, these aspects deserve careful consideration.
The analyses and annotation of large-scale transcriptomic, proteomic and genomic sequence datasets for pathogens could be facilitated through the establishment of a 'reference' website for CVBDs. Such a website could provide regular releases of newly developed and validated bioinformatic pipelines for the analyses of sequence datasets. It could also provide links to regularly updated databases that are routinely employed for the annotation of new sequences as well as a distinct, high-quality database of curated functional annotations, supported by experimental data published in peer-reviewed, international publications. In the future, the establishment of a 'centralized' resource to enable the sharing and optimization of bioinformatic pipelines for sequence processing and annotation and, more broadly, to allow access to new sequence data, and experimental protocols and relevant literature would be advantageous.
The annotation of peptides inferred from a dataset is conducted by assigning predicted biological function/s based on comparison with existing information available for related organisms in public databases, including InterPro http://www.ebi.ac.uk/interpro/, Gene Ontology, http://www.geneontology.org/, OrthoMCL http://www.orthomcl.org/, BRENDA http://www.brenda-enzymes.org/. Using this approach, predictions for key groups of molecules can be made regarding their fundamental functional and essential roles in biological processes . Such groups include molecules linked to the physiology of the nervous system , the formation of the cuticle (arthropods and nematodes) [37, 83], reproduction, development, signal transduction and/or pathogen invasion and disease processes (e.g., proteases and protease inhibitors, protein kinases and phosphatases) [36–39, 84].
The bioinformatic prediction and prioritization of novel drug targets involves 'filtering' [85, 86] and usually includes inferring targets based on key principles and requirements [87–91]. First, target proteins should have one or more essential roles in fundamental biological processes of the pathogen and/or vector, such that the disruption of the molecule or its gene will damage and/or kill both or either and thus disrupt disease transmission or disease itself, but not affect the host [90, 92]. In the absence of phenotypic data for many pathogens/vectors, the prediction of drug target candidates in eukaryotic pathogens/vectors can be assisted by using extensive information on function and essentiality in a range of eukaryotic organisms, including S. cerevisiae, D. melanogaster, C. elegans and M. musculus. This information can be accessed via public databases, including FlyBase at http://flybase.org/, WormBase at http://www.wormbase.org, Mouse Genome Informatics at http://www.informatics.jax.org/ and Saccharomyces Genome Database at http://www.yeastgenome.org/) [39, 89, 93–95]. Since most effective drugs achieve their activity by competing with endogenous small molecules for a binding site on a target protein , the amino acid sequences predicted from essential genes should be screened for the presence of relatively conserved ligand-binding domains [96, 97]. Lists of inhibitors, known based on experimental evidence, to specifically bind to such domains, can be compiled. However, the predictions made are intended to support hypothesis-driven or applied research and thus require extensive experimental investigations. The main advantage for a number of CVBD pathogens (e.g., Babesia and Leishmania) over, for example, some parasitic helminths, is that they can be propagated readily in vitro (e.g., [98, 99]). This provides unique prospects to test gene function(s) by double-stranded RNA interference, transgenesis and/or deletion studies as well as using small molecular inhibitors (cf. [100–102]).
Based on recent evidence [103–105], guanosine triphosphatases (GTPases), protein phosphatases and protein kinases seem to represent attractive drug target candidates for a range of pathogens, but have not yet been examined on a genome-wide scale and in a systematic manner for most CVBDs. Multiple cellular signaling pathways function through the activity of small GTP-binding proteins to regulate multiple biological processes, such as transmembrane signal transduction, cytoskeletal reorganization, gene expression, intracellular vesicle trafficking, microtubule organization and nucleocytoplasmic transport . GTPases are small (~20-28 kDa), monomeric proteins belonging to six families (i.e., Ras, Rho, Rab, Arf, Ran and Rad; ). These regulatory proteins act as bi-molecular switches that cycle between two conformational states (i.e., GDP-bound ["inactive" state] and GTP-bound ["active" state]) and hydrolyze GTP. In humans, the aberrant regulation of GTPases is linked to a number of dysfunctions, including neurological and developmental disorders and cancer . In addition, intracellular pathogenic bacteria, such as Mycobacterium tuberculosis, are known to target host GTPases to evade host immune responses to facilitate the infection process . Such information has stimulated efforts to develop novel therapeutic strategies to inhibit the function of GTPases. For instance, treatments with farnesyltransferase inhibitors, to block the oncogenic properties of Ras GTPases, have been shown to be effective in significantly reducing the progression of various forms of cancer, including carcinomas of the colon, pancreas and lung, neurofibrosarcoma and chronic myelogenous leukaemia, in experimental animals [110, 111] and the migration and organization of the cytoskeleton of human prostate cancer cells . Although the overall structure of individual small GTPases is conserved across eukaryotes, the filtering of datasets for the organism of interest (i.e., pathogen and/or vector) allows the identification of significant differences in sequence of GTPases between the invertebrate and the definitive host. These differences might be considered in future studies, aimed at assessing the possibility of designing and synthesizing selective and specific inhibitors against parasite GTPases. Homology modelling [113, 114], X-ray crystallography/nuclear magnetic resonance (NMR) and docking [115–120] studies should assist in this process.
Selected protein kinases (PKs) are also potential drug targets for a range of pathogens. PKs belong to a large family of proteins regulating development, cell division, differentiation and metabolism in many organisms; these molecules are considered the second most important group of drug targets after GPCRs [121, 122]. The family of PKs comprises cell surface receptors and non-receptor or cytosolic kinases. Integrated genomic-bioinformatic-chemoinformatic approaches have been employed for the identification and screening effective PK inhibitors as therapeutic agents [123–125]. For example, in studies aimed at identifying novel inhibitors of a human tyrosine kinase involved in the development and progression of chronic myelogenous leukemia, 15 compounds were selected following in silico screening of a database of 200,000 known inhibitors . Of these compounds, eight were shown to selectively inhibit the growth of leukemia in vitro. In another study, novel and selective inhibitors of caseine kinase II (CK2) were identified via in silico screening of a database containing ~400,000 compounds, followed by in silico docking . These examples indicate the advantages of using computer-aided tools for the rational prediction and design of drugs for subsequent in vitro and in vivo efficacy testing . Nonetheless, it is clear that any compound shown to be efficacious must also be rigorously tested for its safety (see ; http://www.ich.org/cache/compo/276-254-1.html).
Because of the regulatory role that PKs play in a number of signaling pathways in the cell, interference with their activity can result in the disruption of fundamental homeostatic processes in parasites . In the last years, protein kinases have received particular attention as drug targets in protists, such as species of Plasmodium, Leishmania and Trypanosoma and helminths . For instance, particular inhibitors of pyrrole and imidazopyridine cyclic guanosine monosphosphate-dependent protein kinases of Leishmania major have been shown to severely impair the growth of the promastigote forms of this protozoan parasite in vitro. In some helminths, for example, PK inhibitors (i.e., tyrphostins AG1024 and AG538) have been shown to significantly affect the survival and development of the adult parasite through the blockage of glucose uptake . The inactivation of PKs with herbimicin A has also been shown to interfere with mitosis, thus significantly affecting the expression of proteins essential for egg production in the worm . Although the crystal structures of PKs in many pathogens have not yet been defined, progress has been made in the identification and design of effective inhibitors based on homology models for protein kinases from humans . There is evidence that the active sites of parasite PKs display subtle differences compared with their human counterparts , which is considered promising for the development of parasite-specific kinase inhibitors. However, much more study is required to establish the potential of PK inhibitors against pathogens causing CVBDs. This is obviously a research area worth pursuing.