WASPS Help

Abstract

Every geneticist interested in genome evolution will have to face the analysis and comparison of strains of plasmids and viruses. For this purpose, several plasmids resources have been developed very recently and propose either a comprehensive manually curated bacterial plasmid list or a plasmid database which can be interrogated using sequence similarity programs. To our knowledge, bioinformatics tools providing geneticists with deeper plasmid analyses such as extensive gene synteny are still missing at this time. Widely used existing prokaryotic synteny web services are mostly geared towards cellular genomes and provide at most limited plasmid analysis capabilities. Here we present the Natural Plasmid DataBase (WASPS) which is developed in a plasmid-centric fashion and proposed to the researcher as a straightforward web service. The novelty of WASPS consists in pre-calculated orthologous clusters covering all plasmid-encoded proteins available in the database. The ‘fully connected’ clustering topology allows the near-instantaneous generation of completely resolved synteny maps. The relevant features of WASPS include: 1) text-based search in plasmid definitions and accession numbers; 2) protein and DNA-based similarity searches with user-provided sequences; 3) synteny mapping using WASPS plasmids as query; 4) synteny mapping using a user-provided annotated file in Genbank format and 5) synteny mapping using unannotated raw or Fasta DNA file. Features 3 to 5 involving the display of gene synteny maps benefit from technologies provided by latest Web standards. Extensive use of the HTML5 canvas allows for a highly responsive visualization without server recalculation. Wide genomic areas can be explored by panning and zooming directly within the client browser. Mapping data can be exported either as .PNG bitmaps or under .SVG vectorial form. The plasmid dataset originates from the NCBI plasmid repository and is converted into a compact database locally on our server. The database compilation is optimized and fully automated to allow frequent updates and ensure exhaustiveness of the analyses. It comprises all natural bacterial, archaeal and eukaryal plasmids.

General description. The WASPS Natural Plasmid Database is composed of three independent components: the WASPS Database, the WASPS Webtool and the WASPS Updater.

  • The WASPS Database is a free standing entity designed to be accessible locally via server applications and scripts or remotely via specifically designed web services.

  • The WASPS Webtool offers public access to the WASPS Database. Remote users can perform a variety of queries and retrieve several levels of information relative to a particular plasmid or investigate the evolutionary relationship between different plasmids in the database using gene synteny. Similarly, user-provided annotated or unannotated plasmid sequences can be analyzed as well.

  • The WASPS Updater regenerates the whole WASPS Database at fixed intervals to ensure maximal exhaustiveness of all plasmid analyses.

Synteny

  • Etymology
    Synteny = on the same ribbon
    Greek: σύν (syn) = 'along with' and ταινία (tainiā) = 'band'

  • Origin
    The term 'synteny' was introduced by John H. Renwick at the 4th Int. Congress of Human Genetics in 1971. “Synteny (or syntenic) refers to gene loci on the same chromosome whether or not they are genetically linked by classic linkage analysis.” (Renwick, 1971)

  • Debate
    Some considered that the usage of the term synteny was incorrect. “Synteny refers [incorrectly] to gene loci in different organisms located on a chromosomal region of common evolutionary ancestry.” (Passarge, Bernhard & Farber, Nat. Gen. Corresp. 1999)

  • Present concept of 'shared synteny'
    The concepts of 'synteny' or 'shared synteny' are now commonly accepted and no longer controversial. Synteny constitutes the most reliable criteria for establishing the orthology of genomic regions in different species and reflects important functional relationships between genes.

  • WASPS synteny maps
    WASPS Webtool is designed to display evolutionary relationships between natural plasmids and user-submitted DNA sequences. Each line corresponds to a single plasmid where multiple arrows refer to gene open reading frames (Fig.1). The orthology between genes on the same plasmid or on different plasmids is indicated by a common color/pattern combination. Patterns have been introduced to increase the number of orthologs that can displayed at once to overcome the limitation of using color hues only. Genes represented in white color are singletons: they don't share orthology reltionships with any other gene in the map. WASPS is calculating protein similarity (and not DNA) to assess gene orthology relationships relying on the following programs: UCLUST, BLAST and DIAMOND depending on the option chosen. WASPS orthologous clusters are established at database creation time and colors/patterns combinations are assigned to each cluster at run time.

    wasps_synteny Figure 1. Plasmid synteny displayed in WASPS synteny maps.

Gene orthology

An orthologous gene is a gene in different species that evolved from a common ancestor by speciation. Orthologous genes retain the same function in the course of evolution.

1. WASPS Database

1.1 WASPS Database structure. Annotated plasmids entries (RefSeq) are retrieved form the NCBI FTP site in binary form and processed locally on this server to generate the WASPS Database. The RefSeq plasmid records are updated on a regular bas on the NCBI FTP site every 2 or 3 weeks. This process involves the addition of new entries which receive a two part NCBI identifier (accession.version). Updates of pre-existing entries keep the same accession number and will increment the version number by one unit. These ‘accession.version’ identifiers are used to univocally describe both plasmid and gene entries. These identifiers are used to link the different data bins composing WASPS database. The central part of the WASPS database consists of a single XML file containing a sequential list of plasmids exposing their relevant fields. Each plasmid contains a gene list field to store relevant genetic data (Fig. 2). Each protein in the WASPS database is therefore identified by double ‘accession.version’ under the format ‘gene_accession.version#plasmid_accession.version’. Plasmid DNA sequences and protein sequences are stored in separate bins but intimately linked to the central XML using the ‘accession.version’ identifiers. All fields and DNA or protein sequences extracted or parsed from the downloaded NCBI GenBank and DNA Fasta files. Protein orthology relationships are determined using UCluster and injected where appropriate in the XML file. The cluster centroids calculated with UClust are collected separately into an additional bin. The text bins containing DNA, total proteins and centroid proteins sequences are then converted in binary format in order to be efficiently queried by BlastN, BlastP, TBlastN or PsiBlast (Fig. 2).

DB

Figure 2. Structure of the WASPS Natural Plasmid Database.

2. WASPS Webtool

The WASPS Webtool provides the user with different ways to query the WASPS Database.

2.1 Text-based WASPS search. This is the simplest way to interrogate the WASPS Database. A user-provided text string will be matched against four fields of the database: i) plasmid accession field; ii) plasmid definition field; iii) gene accession field and iv) gene definition field. If the query is successful, the WASPS Webtool will provide relevant ‘accession.version’ numbers followed by the corresponding definition. Each hit will specify in addition its plasmid or gene origin. ‘Accession.version’ identifiers obtained in this way can be used with other WASPS Webtool queries/options. A domain-specific text search is also possible. Text search results are connected to the plasmid synteny page via -> Synteny links to facilite further analyses.

2.2 Perform local plasmid synteny. A user-provided genetic or plasmidic ‘accession.version’ identifier will be matched against the database. The webtool will extract the list of plasmids sharing orthologous relationships with the query gene or plasmid. Plasmids from this list can be further selected to be displayed on the Synteny Map Interface. Conservation of protein encoding gene order or synteny is indicated by consistent icon & coloring. This map allows smooth pan and zoom navigation using a three button wheel mouse. Synteny results can be exported in bitmap PNG or vector SVG formats.

2.3 Perform external plasmid synteny (raw/Fasta). A user-provided plasmid sequence in raw or Fasta format (therefore unannotated) will be scanned for open reading frames (ORFs) in the six possible frames. These ORFs are translated into protein and matched against the centroid database. Hits obtained below the preselected threshold E-value will be assigned a cluster number and colorized accordingly. The search mode of related plasmids is user-selected according to two rules: i) display the primary hits only or ii) display all cluster-related hits. Plasmids from this list can be further selected to be displayed on the Synteny Map Interface. Conservation of protein encoding gene order or synteny is indicated by consistent icon & coloring. The map will therefore display the predicted ORF results for the submitted sequence in the 6 reading frames followed by the related plasmids from the WASPS Database. This map allows smooth pan and zoom navigation using a three button wheel mouse. Synteny results can be exported in bitmap PNG or vector SVG formats. Translated open reading frames can be exported as protein sequences in Fasta format. Two option are available: export the complete set or translated ORFs or only the subset which was assigned a predicted function by the WASPS Webtool.

2.4 Perform external plasmid synteny (GenBank). Annotated plasmids in GenBank format can be submitted by the user for further analysis. Annotated proteins sequences are extracted and matches against the centroid database. Hits obtained below the preselected threshold E-value will be assigned a cluster number and colorized accordingly. The search mode of related plasmids is user-selected according to two rules: i) display the primary hits only or ii) display all cluster-related hits. Plasmids from this list can be further selected to be displayed on the Synteny Map Interface. Conservation of protein encoding gene order or synteny is indicated by consistent icon & coloring. The map will therefore display the query plasmid on the first line followed by the related plasmids from the WASPS Database. This map allows smooth pan and zoom navigation using a three button wheel mouse. Synteny results can be exported in bitmap PNG or vector SVG formats.

2.5 Blast a protein or DNA sequence against WASPS. The WASPS Database can be queried with user-provided single DNA sequences (with BlastN) or protein sequences (BlastP, TBlastN & PsiBlast). Sequences can be submitted in raw or Fasta format. The results page displays an enhanced list of hits with embedded NCBI links and the relative sequence alignments. BLAST search results are connected to the plasmid synteny page via -> Synteny links to facilite further analyses.

2.6 WASPS Dashboard. The WASPS dashboard display statistical information on the database in graphical form.

2.7 WASPS Help. This page.

2.8 WASPS links. To come.

3. WASPS Updater

The WASPS Updater regenerates the whole database at fixed intervals to ensure maximal exhaustiveness of all plasmid analyses. The WASPS update is a complex process which has been largely optimized for execution speed and low CPU resource consumption (Fig. 3). It is completely unsupervised and automated and occurs at the frequency of NCBI RefSeq plasmid updates. Each update process will regenerate a new database from scratch due to the fact that protein orthology calculations cannot be produced reliably using incremental database updates. Plasmid data originates from the National Center for Biotechnology Information (NCBI) and is retrieved from their public access FTP site. At this stage, only the RefSeq plasmid releases are considered in the WASPS Database because they constitute a non-redundant and well-annotated set of sequences. The XML file described above constitutes the core of the WASPS database and is stored on disk and deserialized to server cache memory to ensure fast response across sessions/users. Alternative binary formats were tested for database disk storage but their deserialization ranked lower that the XML file in benchmark tests for this particular implementation.

DB

Figure 3. WASPS database generation pipeline.

4. Additional information

4.1 Input file formats. The WASPS Webtool accepts user-submitted plasmid or protein data in three different formats:

  • Raw format. The raw format is the simplest format which can be produced for proteins and DNA. It consists of a text string of amino acids or nucleotides represented as single characters. While carriage returns are accepted (and ignored) in this format, any other non-amino acid or non-DNA character with cause submission failure (Fig. 4).

    DB

    Figure 4. Raw sequence format.

  • Fasta format. The Fasta format contains a comment line starting with the '>' character. The actual sequence string initiates at the second line and is similar to the raw format (Fig. 5).

    DB

    Figure 5. Fasta sequence format.

  • GenBank format. The GenBank format contains a plethora of annotations, in addition to the complete sequence data at the end. GenBank obeys to very strict formatting rules and conventions, it conveys all the available information relative to a specific sequence (Fig. 6).

    DB

    Figure 6. GenBank sequence format.

4.2 UCLUST clustering algorithm. UCLUST (Edgar, 2010) is a powerful clustering alorithm used by the WASPS Database to precalculate protein orthologous clusters at database creation. UCLUST is also providing a centroid protein for each orthologous cluster. The WASPS Webtool uses these centroids to quickly determine orthologous relationships at run time between user provided sequences and the WASPS Database.

4.3 BLAST. Five similarity search options are available:

  • DIAMOND. This option, given a protein query, returns the most similar protein sequences from the WASPS protein database (using very fast DIAMOND).

  • BlastP. This option, given a protein query, returns the most similar protein sequences from the WASPS protein database (using classical NBCI BlastP).

  • BlastN. This option, given a DNA query, returns the most similar protein sequences from the WASPS DNA database.

  • TBlastN. This option compares a protein query against all six reading frames of the WASPS DNA database.

  • Psi-Blast. This option is used to find distant relatives of a protein in the WASPS protein database.

4.4 E-value. The lower the E-value, or the closer it is to zero, the more "significant" the match is.

4.5 Search mode. Currently two synteny search modes are provided by the WASPS Webtool.

  • Primary hits only. This option will only retrieve the plasmid primary hits which correspond to the best BlastP hit for each protein against the WASPS centroid database

  • Extented cluster hits. This option will extent the best plasmid hits by adding all related plasmids using the WASPS protein cluster database

4.6 Search algorithm. To draw synteny maps using user-provided DNA sequences in raw, Fasta or GenBank formats, the WASPS Webtool will extract corresponding protein sequences and match them against the WASPS centroid database bin. They are used to assess the statistical significance of the sequence similarity using the E-value. All protein-protein similarity searches are performed locally on the WASPS server. Currently two similarity search algorithms are proposed by the WASPS Webtool.

  • BLAST. BLAST is one of the most widely used bioinformatics programs for sequence searching and is provided by the National Center for Biotechnology Information (NCBI). (Altschul et al, 1990)

  • DIAMOND. DIAMOND provides a fast and sensitive protein alignment algorithm (Buchfink et al, 2015). Performance-wise, DIAMOND oupterforms in terms of speed and accuracy most competing similarity search programs. The implementation of DIAMOND for orthologous protein assessment in WASPS is significantly faster in most cases but more variable than BLAST depending on the server load.

4.7 Graphical export. Synteny maps generated by the WASPS Webtool can be exported in several formats for storage or further graphical elaboration.

  • PNG. Portable Network Graphics (PNG) is a raster-graphics non-patented file format that supports lossless data compression. PNG files can be manipulated in pixel-oriented graphics programs such as Adobe Photoshop, Affinity Photo or Gimp. For the WASPS Webtool, PNG export acts as a screenshot: only the visible parts of the synteny map will be exported.

  • SVG. Scalable Vector Graphics (SVG) is an XML-based vector image format for two-dimensional graphics. The vectorial nature of this format allows substantial enlargement without resolution loss. SVG files can be manipulated in vector-oriented graphics programs such as Adobe Illustrator, Affinity Designer or Inkscape.

4.8 Protein sequence export. The synteny map produced with a user-submitted raw or Fasta plasmid sequence allows the export in Fasta format of i) all predicted translated ORFs or ii) only the predicted translated ORFs that have been assigned a function by the WASPS Webtool.

4.9 Plasmid selection. Plasmids found by the WASPS Webtool are listed on the leftmost listbox. Further processing to generate synteny maps require the presence of user-selected plasmids in the rightmost listbox. A series of intuitive buttons allow plasmid movement between these two listboxes. A maximum of 50 different plasmids can be visualized at once on the synteny maps, corresponding to the 50 topmost plasmids in the rigthmost listbox.

4.10 Synteny Map Interface navigation. The Synteny Map Interface has been developed to allow a maximal user-interactivity with the genetics maps generated by the WASPS Webtool. User interactivity is achieved my the means of an inexpensive 'three button wheel mouse', a standard equipment for most modern desktop computers (Fig. 7). Equivalent gestures are available for laptop trackpads or touch screen devices and provided by the respective operating systems.

DB

Figure 7. Three button wheel mouse.

2D synteny maps can be smoothly panned and zoomed directly in the web browser without requiring data transfer from or to the server. This remarkable property has been developed by exploiting the latest D3 javascript libraries designed to display Data Driven Documents.

  • Pan. The Synteny Map Interface can be panned by holding down the left mouse button.

  • Zoom. The Synteny Map Interface can be zoomed by holding by turning the mouse wheel.

  • Hovering. Context-sensistive information is available for each displayed gene in the synteny maps. Mouse hovering on a specific gene will present its definition in a tootip (Fig. 8).

    hover Figure 8. The hovering tooltip appears in yellow color.

  • Context menu. Right clicking on a specific gene will open a context menu with four options (Fig. 9):
     i)Info : protein gene accession, plasmid accession, protein definition and protein cluster.
     ii)Single sequence : protein sequence of the highlighted gene in Fasta format.
     iii)Multiple sequences : the whole protein cluster related to the highlighted gene.
     iv)NCBI : link to the protein (Genbank format) at the NCBI.

    context Figure 9. The context menu appears in white color.

5. Further developments

  • Granting REST access to the database, allowing remote researchers to build their own web apps to exploit the WASPS Database.
  • A comprehensive natural virus database built along the same model as WASPS.

6. References

  • Altschul S.F., Gish W., Miller W., Myers E.W. & Lipman D.J. (1990) Basic local alignment search tool. J. Mol. Biol. 215(3):403-10. [PubMed]

  • Buchfink B., Xie C. & Huson DH. (2015) Fast and sensitive protein alignment using DIAMOND. Nat. Methods 12(1):59. [PubMed]

  • Edgar R.C. (2010) Search and clustering orders of magnitude faster than BLAST. Bioinformatics 1;26(19):2460. [PubMed]

  • Passarge E., Horsthemke B. & Farber R.A. (1999) Incorrect use of the term synteny. Nat Genet. 23(4):387. [PubMed]

  • Renwick J.H. (1971) The mapping of human chromosomes. Annu Rev Genet. 5:81-120. [PubMed]