De novo transcriptomic analysis of the female and male adults of the blood fluke Schistosoma turkestanicum

Schistosoma turkestanicum is a parasite of considerable veterinary importance as an agent of animal schistosomiasis in many countries, including China. The S. turkestanicum cercariae can also infect humans, causing cercarial dermatitis in many countries and regions of the world. In spite of its significance as a pathogen of animals and humans, there is little transcriptomic and genomic data in the public databases. Herein, we performed the transcriptome Illumina RNA sequencing (RNA-seq) of adult males and females of S. turkestanicum and de novo transcriptome assembly. Approximately 81.1 (female) and 80.5 (male) million high-quality clean reads were obtained and then 29,526 (female) and 41,346 (male) unigenes were assembled. A total of 34,624 unigenes were produced from S. turkestanicum females and males, with an average length of 878 nucleotides (nt) and N50 of 1480 nt. Of these unigenes, 25,158 (72.7 %) were annotated by blast searches against the NCBI non-redundant protein database. Among these, 21,995 (63.5 %), 22,189 (64.1 %) and 13,754 (39.7 %) of the unigenes had significant similarity in the NCBI non-redundant protein (NR), non-redundant nucleotide (NT) and Swiss-Prot databases, respectively. In addition, 3150 unigenes were identified to be expressed specifically in females and 1014 unigenes were identified to be expressed specifically in males. Interestingly, several pathways associated with gonadal development and sex maintenance were found, including the Wnt signaling pathway (103; 2 %) and progesterone-mediated oocyte maturation (77; 1.5 %). The present study characterized and compared the transcriptomes of adult female and male blood fluke, S. turkestanicum. These results will not only serve as valuable resources for future functional genomics studies to understand the molecular aspects of S. turkestanicum, but also will provide essential information for ongoing whole genome sequencing efforts on this pathogenic blood fluke.


Background
Schistosomiasis is a neglected tropical disease caused by blood flukes (genus Schistosoma) that ranks second only to malaria among the parasitic diseases and affects more than 200 million people in a number of tropical and subtropical countries [1]. It is estimated that schistosomiasis causes an annual loss of between 1.7 and 4.5 million disability adjusted life years (DALYs) and contributes to several hundreds of thousands of deaths annually [2][3][4]. To date, no vaccines are available, and treatment relies on one drug, praziquantel [5].
S. turkestanicum (syn. Orientobilharzia turkestanicum) is a parasite of major veterinary importance as an agent of animal schistosomiasis, which infects a range of animals including cattle, sheep, goats, water buffaloes, horses, donkeys, mules, camels, and causes considerable economic losses [6]. Importantly, the cercariae of S. turkestanicum can also infect humans, in which they can cause cercarial dermatitis, and is considered the major pathogen of cercarial dermatitis in many countries and regions of the world, including China [6,7]. In spite of its medical and veterinary importance, the genetics, epidemiology and biology of this parasite remain poorly understood.
Recent developments in deep sequencing or nextgeneration sequencing (NGS) provide an opportunity for rapid and cost-effective generation of genome-scale data [8,9]. NGS technology, such as Solexa/Illumina platforms, has dramatically improved the efficiency of gene discovery, and is particularly suited to transcriptomics of distinct animal and plant taxa [10][11][12][13]. To date, many trematode transcriptomes have been sequenced using the NGS technology, such as Fascioloides magna [14]; Clonorchis sinensis [15]; Eurytrema pancreaticum [16], Schistosoma japonicum [17], Schistosoma mansoni [18], Schistosoma haematobium [19], Opisthorchis felineus [20], Paramphistomum cervi [21] and Fasciola hepatica [22]. In spite of the availability of advances in sequencing technologies and bioinformatic methods, there is still a paucity of knowledge of the complete transcriptome for this zoonotic fluke S. turkestanicum. Herein, a NGS platform and powerful de novo short-read assembly was employed to uncover a global view of the transcriptomes of adult male and female S. turkestanicum worms. The aim of the present study was to produce transcriptomic data to aid the better understanding of the biology of S. turkestanicum, which would facilitate the identification of intervention targets for S. turkestanicum and other medically and veterinary important trematodes.

Ethics statement
This study was approved by the Animal Ethics Committee of Lanzhou Veterinary Research Institute, Chinese Academy of Agricultural Sciences (Permit code. LVRI AEC2013-006). The sheep, from which S. turkestanicum was collected, were handled in accordance with good animal practices required by the Animal Ethics Procedures and Guidelines of the People's Republic of China.

Sample preparation
Adult blood flukes representing female and male S. turkestanicum (codes: OutF and OutM) were obtained from the portal and mesenteric veins of naturally infected sheep at a local slaughter house in Heilongjiang province, China. One blood fluke was collected from each of the five sheep, which were randomly selected for slaughter and sale, following the perfusion of the mesenteric and intestinal vessels using physiological saline (37°C). Five blood flukes (three females and two males, respectively) were washed in physiological saline six times to remove any contamination with bacterial and host DNA, identified morphologically as S. turkestanicum according to existing keys and descriptions [23], and their identity was further ascertained by direct amplification and sequencing of the internal transcribed spacer (ITS) region, as previously described [24]. The male and female S. turkestanicum worms were immediately frozen in liquid nitrogen and stored at −80°C until use.

RNA isolation and Illumina sequencing
Total RNA was extracted from pooled females (n = 3) and pooled males (n = 2) of S. turkestanicum, respectively, using Trizol reagent (Invitrogen, Life Technologies, Carlsbad, CA, USA) according to the manufacturer's protocol, and was stored at −80°C until use. The Oligo (dT) was used to isolate poly (A) mRNA from total RNA. Mixed with the fragmentation buffer, the mRNA was fragmented into short fragments. Then cDNA was synthesized using the mRNA fragments as templates. Short fragments were purified and resolved with EB buffer for end reparation and single nucleotide A (adenine) addition. After that, the short fragments were connected with adapters. The suitable fragments were selected as templates for the PCR amplification. During the Quality Control (QC) steps, an Agilent 2100 Bioanaylzer and an ABI StepOnePlus Real-Time PCR System were used in quantification and qualification of the sample library. Illumina HiSeq™ 2000 was applied for sequencing at BGI-Shenzhen, Shenzhen, China according to the manufacturer's instructions (Illumina, San Diego, CA, USA).

De novo assembly
Prior to assembly, the high-quality clean reads were obtained from raw reads by removing adaptor sequences, highly redundant sequences, reads that contained more than 10 % 'N' rate (the 'N' character representing ambiguous bases in reads), and low quality reads containing more than 50 % bases with Quality (Q) value (Q-value ≤ 10). De novo assembly of the clean reads was performed by using the Trinity software, which was designed specifically for transcriptome assembly [25]. Briefly, Trinity (this is in Trinity software) first combines reads of a certain length of overlap to form longer fragments, which are called contigs. Then the reads are mapped back to contigs. Trinity connects the contigs and identifies sequences that cannot be extended on either end. Such sequences are defined as unigenes. Unigenes from each sample's assembly can be used in further processes of sequence splicing and redundancy removal with sequence clustering software TGICL [26] in order to acquire non-redundant unigenes that are as long as possible.

Bioinformatics analysis
Unigene sequences were first aligned to the NCBI nonredundant proteins (NR), non-redundant nucleotide (NT), Swiss-Prot, Cluster of Orthologous Groups (COG) and Kyoto Encyclopedia of Genes and Genomes (KEGG) databases by BLASTx, using an e-value <0.00001. The unigenes were tentatively annotated according to the known sequences with the highest sequence similarity. The annotated unigenes direction and predicted protein coding sequences (CDS) were identified by the best alignment results. ESTScan [27] was used to predict the CDS and the sequence direction when unigenes were unaligned to any of the databases. With NR annotation, the Blast2GO program was used to classify unigenes to Gene Ontology (GO) terms such as molecular function, biological processes, and cellular components [28]. After obtaining GO annotations for all unigenes, WEGO software [29] was used to perform GO function classification for all unigenes and to analyse the distribution of S. turkestanicum gene functions at the macro-level. Simple sequence repeats (SSRs) in the nucleotide sequences were identified using the MIcroSAtellite (MISA) identification tool [30]. The poly-A and poly-T sequences at the terminal regions of the UTs were removed before SSR identification. SOAPsnp [31] was used (with parameters -u t -Q i -L 90) on pileup files to output lists of single nucleotide polymorphisms (SNPs) and their locations.

Identification of female and male differentially expressed genes
The calculation of Unigene expression used the FPKM method (Fragments Per kb per Million reads) [32]. RPKM was directly used to compare the difference of gene expression levels between females and males. False Discovery Rate (FDR) control is a statistical method used in multiple hypothesis testing to correct for p-value. In our analysis, we chose those with FDR ≤ 1 × 10 −3 and |log 2 ratio| ≥2 to identify sex-biased genes. Differentially expressed genes were then subjected to GO functional analysis and KEGG Pathway analysis.

Sequencing and assembly
Using Illumina (paired-end) sequencing technology, we obtained a total of 86,697,820 and 84,578,440 raw reads from female and male adults of S. turkestanicum, respectively. After quality filtration, 81,086,908 (total clean nucleotides: 7,297,821,720 nt) and 80,469,244 (total clean nucleotides: 7,242,231,960 nt) high quality clean reads were generated for assembly from females and males, respectively. All clean reads were submitted to the Sequence Read Archive database at NCBI (accession nos: SRP068812 and SRP068813). The Q20 percentage (Q20 percentage is the proportion of sequencing nucleotides error rate with quality value <1 %) and GC percentage of females were 97.69 % and 40.62 %, respectively. The Q20 and GC percentages of males were 97.86 % % and 41.26 %, respectively. Based on the clean reads, a total of 46,041 (>200 nt) and 69,543 (>200 nt) contiguous sequences (contigs) without gaps were produced by Trinity software for S. turkestanicum females and males, respectively (Table 1). Short unigenes (less than 200 nt) were removed. A total of 34,624 unigenes (total length: 30,412,123 nt) were produced from all samples (females and males), with an average length of 878 nt and N50 (A measure of the contig or scaffold length. It is the maximum length L such that 50 % of all nucleotides lie in contigs or scaffolds of size at least L) of 1480 nt. A total of 29,526 and 41,346 unigenes were produced by the clustering of S. turkestanicum sequences from females and males, respectively (Table 1). In females, 5356 unigenes (18.1 %) were longer than 500-1000 nt, and 3160 unigenes (10.7 %) were longer than 1000-1500 nt. In males, 6458 unigenes (15.6 %) were longer than 500-1,000 nt, and 3356 unigenes (8.1 %) were longer than 1000-1500 nt. All unigenes showed no gaps, thus demonstrating the high quality of Trinity assembly. The length distribution of these unigenes is shown in Fig. 1.

Functional annotation
In order to obtain and validate sequence-based annotations for all assembled unigenes, we employed Blastx for a sequence similarity search against the NR, NT, Swiss-Prot, GO, COG and KEGG databases, with an e-value <0.00001. Of these unigenes, 25 A total of 3,764 unigenes were assigned 3,652 GO term annotations, which could be classified into three categories: biological process, molecular function, and cellular component. The biological process category consisted of 2,328 GO terms, which were assigned to 2,728 (7.9 %) unigenes. The cellular component category consisted of 495 GO terms, which were assigned to 2,108 (6.1 %) unigenes, and the molecular function category consisted of 829 GO terms, which were assigned to 2,965 (8.6 %) unigenes (Fig. 2). Within the biological process category, most unigenes (2,296) were assigned to "cellular processes". In the cellular component category, most unigenes (1,844) were assigned to "cell". In the molecular function category, the majority of unigenes (1,775) were associated with "binding" (Fig. 2).

Discussion
Schistosomiasis caused by Schistosoma spp. remains a disease of considerable economic and medical importance in developing tropical and subtropical countries [33]. However, as a causative agent of human cercarial dermatitis and animal schistosomiasis, little is known of the transcriptome and genome of S. turkestanicum. Hence, this study decoded the trancriptomes of male and female S. turkestanicum by transcriptome sequencing.
Transcriptome sequencing is one of the most important tools for gene discovery. However, the traditional study method (large-scale EST sequencing) is time consuming and expensive. The NGS technology has become a powerful approach for high-throughput gene discovery on a genome-wide scale in organisms [9]. Due to its long read length, fast and accurate data, Solexa/Illumina have been the most widely used platforms for de novo transcriptome sequencing of many organisms, including some parasites [16,[34][35][36][37]. Recently, some studies have indicated that the relatively short reads can be effectively assembled with the great advantage of paired-end sequencing [38,39], and the Illumina transcriptome de novo sequencing and assembly have been successfully used for distinct species [40][41][42][43][44][45]. Therefore, in the present study, a nextgeneration sequencing platform and powerful de novo short-read assembly method was employed to uncover   a global view of the transcriptomes of female and male S. turkestanicum worms, and our results have also indicated that relatively short reads from Illumina pairedend sequencing can be effectively assembled.
In the present study, a great number of unigenes could match unique known proteins in public databases, which is similar with other trematodes [14][15][16][17][18][19][20]. A large number of unigenes were assigned to a wide range of GO categories and COG classifications, indicating that our data represented a wide diversity of transcripts. Based on the KEGG pathway, the top five KEGG pathways were metabolic pathways, pathways in cancer, regulation of actin cytoskeleton, focal adhesion and spliceosome pathway. These results indicated the active metabolic processes in S. turkestanicum development. Therefore, these results also strongly suggested that most of the genes involved in the different metabolic processes were identified through high-throughput Illumina transcriptome sequencing. Interestingly, the 'Immune system or Immune diseases' sub-category, including 22 pathways, was associated with 1,802 (1,802/11,816; 15.3 %) KEGG annotated unigenes, indicating that a much higher proportion of putative KO proteins of S. turkestanicum is involved in pathways associated with the immune system or immune diseases.
In the present study, a total of 13,603 distinct genes were found differentially expressed between S. turkestanicum females and males, which is significantly higher than that of S. japonicum [46], S. mansoni [18] and S. haematobium [19]. Interestingly, several pathways associated with gonadal development and sex maintenance were found, such as Wnt signaling pathway (103; 2 %). The Wnt signaling pathway is a conserved signaling network, which takes part in embryonic development, cell differentiation and proliferation, and the process of growth regulation. Wnt is a secreted glycoprotein, and many different Wnt gene subtypes have been found in a variety of animals. Wnt4 is regarded as a sex determination gene, which plays a key role in the morphological development of female mammals [47], and can regulate the formation of the Müllerian duct and the generation of ovarian steroids [48]. At birth, sexual development in males with a mutation in Wnt4 appears to be normal; however, Wnt4mutant females are masculinized and the Müllerian duct is absent while the Wolffian duct continues to develop. Wnt4 may also be required for the maintenance of the female germ line [49]. Many genes in the Wnt signaling pathway were found in our transcriptomic database, such as casein kinase 1, epsilon, E1A/CREB-binding protein, serine/threonine-protein phosphatase 2A regulatory subunit A, WNT inhibitory factor 1 and F-box and WD-40 domain protein 1/11. Further research is required to elucidate the roles of these genes in the reproductive process of S. turkestanicum.

Conclusions
The present study has produced the transcriptome data for adult female and male S. turkestanicum worms, and significantly enlarges the currently known gene repertoire of S. turkestanicum. These data will serve as a unique resource for future studies of S. turkestanicum gene functions, and will aid the ongoing whole genome sequencing efforts on this blood fluke. The characterization of these transcriptomic data has implications for the better understanding of the biology of S. turkestanicum, and will facilitate the development of intervention agents for this and other pathogenic flukes of human and animal health significance.