Vector sequence contamination of the Plasmodium vivax sequence database in PlasmoDB and In silico correction of 26 parasite sequences
Parasites & Vectors volume 8, Article number: 318 (2015)
We found a 47 aa protein sequence that occurs 17 times in the Plasmodium vivax nucleotide database published on PlasmoDB. Coding sequence analysis showed multiple restriction enzyme sites within the 141 bp nucleotide sequence, and a His6 tag attached to the 3’ end, suggesting cloning vector origins. Sequences with vector contamination were submitted to NCBI, and BLASTN was used to cross-examine whole-genome shotgun contigs (WGS) from four recently deposited P. vivax whole genome sequencing projects. There are at least 26 genes listed in the PlasmoDB database that incorporate this cloning vector sequence into their predicted provisional protein products.
Genome databases are of great value for biomedical research, and have significantly advanced our understanding of the biology of multiple parasite species, including Plasmodium falciparum and Plasmodium vivax, the two most common malaria parasites [1, 2]. The latter genome sequence was produced by shotgun sequencing by Carlton et al. at TIGR in 2008 at five fold coverage, and is deposited at GenBank and PlasmoDB . Assembly errors are inevitable when constructing genomes, and, in the case of intracellular parasites, contamination with host DNA sequence also poses a problem. Indeed, recent research has shown that many published genomes, including mammalian, contain contaminating sequence from a variety of microorganisms . Considering gene prediction errors and malaria parasites specifically, Lu et al. reported that about 20 % of genes are incorrectly predicted in the P. falciparum genome database, although these are mostly due to errors arising from the gene prediction software used .
During a search for repetitive protein fragments in the P. vivax genome conducted on the nucleotide sequences deposited in PlasmoDB  we found a 47 amino acid (aa) sequence (KGQDNSADIQHSGGRSSLEGPRFEGKPIPNPLLGLDSTRTGHHHHHH) repeated a total of 17 times in several annotated contigs. A His6 tag (Fig 1A) was attached to the 3’ end, and multiple restriction enzyme sites (Fig 1B) were present within the 141 bp nucleotide sequence (AAG GGT CAA GAC AAT TCT GCA GAT ATC CAG CAC AGT GGC GGC CGC TCG AGT CTA GAG GGC CCG CGG TTC GAA GGT AAG CCT ATC CCT AAC CCT CTC CTC GGT CTC GAT TCT ACG CGT ACC GGT CAT CAT CAC CAT CAC CAT). This sequence, when run through a VecScreen search (NCBI, http://www.ncbi.nlm.nih.gov/tools/vecscreen/) shows significant similarity to the promoter probe vector pMQ354 (Fig 1C). These features suggest cloning vector sequence contamination. We performed BLASTN searches of these 17 coding sequences against whole-genome shotgun contigs (WGS) of four whole genome sequences (India VII [GenBank: AFMK01000000], North Korean [GenBank: AFBK01000000], Brazil I [GenBank: AFNI01000000], Mauritania I [GenBank: AFNJ01000000]) . All hits were aligned with the reference sequence, and the results showed missing or substituted base pairs at the 3′ end of the query sequences, resulting in the absence of the correct stop codon of the parasite gene, and the incorporation of the vector sequence into the predicted parasite gene protein product, which then terminated at the vector stop codon. Considering that there may be a possibility of frame shifting, we translated the coding sequence in all three frames (Fig 1A), and frames two and three protein were used as query sequences against the PlasmoDB protein database. This resulted in five and four sequence hits respectively, and these nine sequences were subjected to alignment and correction as described before. In total, we discovered 26 sequences in PlasmoDB contaminated by the vector sequence (Table 1).
Generally, cloning vector source sequences are relatively easily recognized by a variety of tools, such as VecScreen. The P. vivax database has been updated more than ten times , and yet this vector sequence contamination persists, suggesting that it may have special characteristics that render it difficult to identify automatically. Attempted PCR amplification of Sal-1 genomic DNA using primers specific for the potential contaminating sequence would provide definitive proof of whether these sequences really are present in the genome, a scenario we believe to be highly unlikely.
The publication of four geographical reference strain whole genome sequences now provides an opportunity for the correction of the genome sequence of the Sal-I reference genome. Given our findings, it is possible that further interrogation of the P. vivax genome deposited in PlasmoDB may reveal further contamination. It is also possible that any previous work that made use of these sequences may require reappraisal.
Gardner MJ, Hall N, Fung E, White O, Berriman M, Hyman RW, et al. Genome sequence of the human malaria parasite Plasmodium falciparum. Nature. 2002;419:498–511.
Carlton JM, Adams JH, Silva JC, Bidwell SL, Lorenzi H, Caler E, et al. Comparative genomics of the neglected human malaria parasite Plasmodium vivax. Nature. 2006;455(7214):757–63.
Carlton J. The Plasmodium vivax genome sequencing project. Trends Parasitol. 2003;19(5):227–31.
Merchant S, Wood DE, Salzberg SL. Unexpected cross-species contamination in genome sequencing projects. PeerJ. 2014;2, e675.
Lu F, Jiang H, Ding J, Mu J, Valenzuela JG, Ribeiro JM, et al. cDNA sequences reveal considerable gene prediction inaccuracy in the Plasmodium falciparum genome. BMC Genomics. 2007;8:255.
Tao ZY, Xu S, Wang YY, Fang Q, Xia H, Gao Q. Plasmodium vivax specific peptides prediction and screening based on repetitive protein sequences and linear B cell epitope. Zhongguo Xue Xi Chong Bing Fang Zhi Za Zhi. 2014;26(3):292–5. 310. [Article in Chinese].
Neafsey DE, Galinsky K, Jiang RH, Young L, Sykes SM, Saif S, et al. The malaria parasite Plasmodium vivax exhibits greater genetic diversity than Plasmodium falciparum. Nat Genet. 2012;44(9):1046–50.
Bahl A, Brunk B, Crabtree J, Fraunholz MJ, Gajria B, Grant GR, et al. PlasmoDB: the Plasmodium genome resource. A database integrating experimental and computational data. Nucleic Acids Res. 2003;31(1):212–5.
We thank Dr. Lu Feng from JIPD for providing valuable advice. And we thank the peer reviewers for their insightful and constructive comments. This work was supported by grants from the National S & T Major Program (Grant No. 2012ZX10004220), the Open Programme of Key Laboratory on Technology for Parasitic Disease Prevention and Control of Chinese Ministry of Health (No. WK014-003), the Anhui Provincial Natural Science Foundation (No. 1308085MH160), the Key Program of Bengbu Medical College Science & Technology Development Fund (No. Bykf13A09) and Natural Science Fund (No. BYKY1402ZD). The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.
The authors declare that they have no competing interests.
ZYT, HX and QG conceived the study and participated in its design and coordination. ZYT, SX and QF carried out sequence comparison and correction. ZYT and RC wrote the manuscript. All authors read and approved the final manuscript.
About this article
Cite this article
Tao, ZY., Sui, X., Jun, C. et al. Vector sequence contamination of the Plasmodium vivax sequence database in PlasmoDB and In silico correction of 26 parasite sequences. Parasites Vectors 8, 318 (2015). https://doi.org/10.1186/s13071-015-0927-x