Vector sequence contamination of the Plasmodium vivax sequence database in PlasmoDB and In silico correction of 26 parasite sequences
© Tao et al. 2015
Received: 22 April 2015
Accepted: 2 June 2015
Published: 12 June 2015
We found a 47 aa protein sequence that occurs 17 times in the Plasmodium vivax nucleotide database published on PlasmoDB. Coding sequence analysis showed multiple restriction enzyme sites within the 141 bp nucleotide sequence, and a His6 tag attached to the 3’ end, suggesting cloning vector origins. Sequences with vector contamination were submitted to NCBI, and BLASTN was used to cross-examine whole-genome shotgun contigs (WGS) from four recently deposited P. vivax whole genome sequencing projects. There are at least 26 genes listed in the PlasmoDB database that incorporate this cloning vector sequence into their predicted provisional protein products.
Genome databases are of great value for biomedical research, and have significantly advanced our understanding of the biology of multiple parasite species, including Plasmodium falciparum and Plasmodium vivax, the two most common malaria parasites [1, 2]. The latter genome sequence was produced by shotgun sequencing by Carlton et al. at TIGR in 2008 at five fold coverage, and is deposited at GenBank and PlasmoDB . Assembly errors are inevitable when constructing genomes, and, in the case of intracellular parasites, contamination with host DNA sequence also poses a problem. Indeed, recent research has shown that many published genomes, including mammalian, contain contaminating sequence from a variety of microorganisms . Considering gene prediction errors and malaria parasites specifically, Lu et al. reported that about 20 % of genes are incorrectly predicted in the P. falciparum genome database, although these are mostly due to errors arising from the gene prediction software used .
Correction of 26 genes affected by a contaminated cloning vector sequence in PlasmoDB
GenBank accession number
Generally, cloning vector source sequences are relatively easily recognized by a variety of tools, such as VecScreen. The P. vivax database has been updated more than ten times , and yet this vector sequence contamination persists, suggesting that it may have special characteristics that render it difficult to identify automatically. Attempted PCR amplification of Sal-1 genomic DNA using primers specific for the potential contaminating sequence would provide definitive proof of whether these sequences really are present in the genome, a scenario we believe to be highly unlikely.
The publication of four geographical reference strain whole genome sequences now provides an opportunity for the correction of the genome sequence of the Sal-I reference genome. Given our findings, it is possible that further interrogation of the P. vivax genome deposited in PlasmoDB may reveal further contamination. It is also possible that any previous work that made use of these sequences may require reappraisal.
We thank Dr. Lu Feng from JIPD for providing valuable advice. And we thank the peer reviewers for their insightful and constructive comments. This work was supported by grants from the National S & T Major Program (Grant No. 2012ZX10004220), the Open Programme of Key Laboratory on Technology for Parasitic Disease Prevention and Control of Chinese Ministry of Health (No. WK014-003), the Anhui Provincial Natural Science Foundation (No. 1308085MH160), the Key Program of Bengbu Medical College Science & Technology Development Fund (No. Bykf13A09) and Natural Science Fund (No. BYKY1402ZD). The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.
- Gardner MJ, Hall N, Fung E, White O, Berriman M, Hyman RW, et al. Genome sequence of the human malaria parasite Plasmodium falciparum. Nature. 2002;419:498–511.PubMedView ArticleGoogle Scholar
- Carlton JM, Adams JH, Silva JC, Bidwell SL, Lorenzi H, Caler E, et al. Comparative genomics of the neglected human malaria parasite Plasmodium vivax. Nature. 2006;455(7214):757–63.View ArticleGoogle Scholar
- Carlton J. The Plasmodium vivax genome sequencing project. Trends Parasitol. 2003;19(5):227–31.PubMedView ArticleGoogle Scholar
- Merchant S, Wood DE, Salzberg SL. Unexpected cross-species contamination in genome sequencing projects. PeerJ. 2014;2, e675.PubMed CentralPubMedView ArticleGoogle Scholar
- Lu F, Jiang H, Ding J, Mu J, Valenzuela JG, Ribeiro JM, et al. cDNA sequences reveal considerable gene prediction inaccuracy in the Plasmodium falciparum genome. BMC Genomics. 2007;8:255.PubMed CentralPubMedView ArticleGoogle Scholar
- Tao ZY, Xu S, Wang YY, Fang Q, Xia H, Gao Q. Plasmodium vivax specific peptides prediction and screening based on repetitive protein sequences and linear B cell epitope. Zhongguo Xue Xi Chong Bing Fang Zhi Za Zhi. 2014;26(3):292–5. 310. [Article in Chinese].PubMedGoogle Scholar
- Neafsey DE, Galinsky K, Jiang RH, Young L, Sykes SM, Saif S, et al. The malaria parasite Plasmodium vivax exhibits greater genetic diversity than Plasmodium falciparum. Nat Genet. 2012;44(9):1046–50.PubMed CentralPubMedView ArticleGoogle Scholar
- Bahl A, Brunk B, Crabtree J, Fraunholz MJ, Gajria B, Grant GR, et al. PlasmoDB: the Plasmodium genome resource. A database integrating experimental and computational data. Nucleic Acids Res. 2003;31(1):212–5.PubMed CentralPubMedView ArticleGoogle Scholar
This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/4.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly credited. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.