Sybil: Web-based software for comparative genomics

 
Home Documentation Contact

Documentation index

Technical overview

The Sybil software package provides a primarily web-based front-end to comparative genome datasets warehoused in a chado relational database. It was developed by the bioinformatics department at The Institute for Genomic Research (TIGR) and development continues at the J. Craig Venter Institute (JCVI) and the Institute for Genome Sciences (IGS) at the University of Maryland: Baltimore. Sybil has been used at TIGR/JCVI, IGS, NYU, New York Medical College, Novartis Vaccines and University of Maryland: College Park to support a number of research projects that involve comparative genome analysis. The following sections provide some high-level technical details about the overall architecture and external dependencies of the Sybil package.

Sybil system architecture

Sybil is implemented primarily in Perl and makes use of the same 3-tiered software framework ("Coati") as Manatee, a related TIGR project. In Coati a set of common Perl modules handle routine tasks such as logging, URL management, low-level database access, and query result caching. Its 3-tiered architecture decouples the presentation layer (the "top" tier, consisting of a set of Perl CGI scripts) from the database access layer (the "bottom" tier, parts of which may depend on the database schema and/or vendor.) This is done by way of a schema-independent middle tier that presents a uniform API to scripts in the presentation layer. Sybil currently supports only a single database schema--the chado schema--but has support for both Sybase and PostgreSQL via Perl DBI and one or more optional database-dependent modules in the database access layer. On system startup a factory object is used to create an instance of the correct database and vendor-specific Perl module (which may be a trivial subclass of a shared, vendor-independent module in the case that none of the SQL queries need to be customized for that particular DBMS.)

The Perl CGI scripts in the presentation layer make use of HTML::Template and a set of template files to handle result formatting (e.g., as HTML or tab-delimited plain text) wherever possible. This prevents the Perl CGI scripts from becoming cluttered with hard-coded HTML tags and allows the system to support new display formats simply by adding the appropriate set of template files. Similarly, URLs (including those of the Perl CGI scripts themselves) are stored in a separate set of XML files and are referenced only by logical name. This ensures that each distinct URL appears only once in a centralized location, greatly simplifying maintenance.

Additional modules specific to Sybil handle generating most of the graphical displays supported by the system. All of the graphical display scripts can generate output as either vector (SVG) or bitmapped (e.g., PNG, JPEG) graphics. In some cases SVG support is provided by GD::SVG and in others it is implemented directly as an alternate rendering method to GD. Some displays can be configured to generate higher-quality bitmapped images by rendering first to SVG and then using the batik-rasterizer tool from the Apache Batik package to produce a PNG or JPEG image. PDF format figures for publication are also generated by converting SVG files to PDF with batik-rasterizer.

Open source tools leveraged by Sybil

Both Sybil and the other software packages on which it depends (i.e., those used to populate and maintain the comparative database) leverage the contributions of several groups that have developed bioinformatics tools and databases under the open source model. Here are some of the open source tools and databases employed by Sybil and the other software components that we use to generate, store, and maintain the genome-scale comparative datasets displayed by the system:

Preparing a comparative database for use in Sybil

Before Sybil can be used to view comparative genome data, a chado database containing the comparative data must be prepared. Here is a rough outline of the steps required to set up a new comparative analysis resource that can be interrogated via Sybil:

Software components involved in data generation and management

Finally, here are some of the software components that are used in the procedure described above. All of the following components were developed at TIGR:

Computes/algorithms

What follows is a partial list of the analyses supported by the current system. Along with each analysis we list some of the key parameters that can be used to tailor it to a specific application, and also the steps and/or analyses that must precede it. Many of these analyses are encoded in workflows that can be run on a distributed compute cluster via the TIGR workflow engine, as discussed above. Some, however, can be invoked on-demand as part of a complex visualization or web display (e.g., the syntenic blocks analysis.)

All-vs-all BLASTP analysis

A simple all-vs-all BLAST matrix that runs WashU-BLASTP on every pair of proteins in the database.

Parameters: Depends on:

Jaccard-clustering analysis

A protein clustering analysis that is typically used to group together highly similar proteins within a single genome/organism of interest. In the sample database, for example, this analysis has been run three times, once for each of the organisms (P. falciparum, P. yoelii, and C. parvum). The Jaccard clustering algorithm relies on two parameters:
  1. a Jaccard coefficient threshold between 0.0 and 1.0
  2. a minimum BLASTP percent identity threshold between 0% and 100%
Typically we have used a Jaccard coefficient threshold of 0.6 and a minimum BLASTP percent identity score of 80%; these values were chosen based on some small empirical studies. Given these two parameters we first make a graph in which the nodes of the graph are the proteins in the genome undergoing Jaccard clustering and the edges of the graph are all of the all-vs-all BLASTP matches with 80% identity or better. For any two proteins (nodes) A and B in the graph we define the Jaccard similarity coefficient as the number of nodes (other proteins) that are connected directly to both A AND B, divided by the number of nodes that are connected directly to either A OR B. A new graph is constructed with the same set of nodes, but where an edge is drawn between two proteins if and only if the Jaccard similarity coefficient for those two proteins is greater than or equal to the chosen threshold (i.e., 0.6). Intuitively, the Jaccard similarity coefficient is essentially measuring how similarly-connected A and B are. An obvious extension to the algorithm would be to take into account the percentage of the protein sequences covered by the BLASTP matches, in addition to just looking at the percent identity. There are situations, however, in which it is advantageous to allow proteins of significantly different lengths to be clustered together (e.g., when looking for gene annotation problems in an unfinished genome based on comparisons to one or more finished and manually-curated genomes.)

Parameters: Depends on:

Sybil COG analysis

In contrast with the Jaccard clustering analysis, this protein clustering algorithm is typically used to cluster proteins from distinct genomes/organisms in order to identify orthologous genes. In practice the algorithm is a simple bidirectional best BLASTP hit analysis; it groups together any pair of proteins for which each is the other's best BLASTP hit (in their respective genomes.) Unfortunately it is easily confuted by situations in which one or more of the genomes contain families of closely-related paralogs. The algorithm will have difficulty resolving the differences between the paralogs, meaning that some of them may not be clustered, and those that are clustered successfully will not necessarily all end up in the same cluster.

Parameters: Depends on:

Jaccard-filtered COG analysis

Run a bidirectional best BLASTP hit clustering algorithm (the Sybil COG analysis) on all of the genomes, taking into account the Jaccard clusters. The algorithm only considers pairs of proteins from different genomes, so let's say that the algorithm is looking at protein A1 from genome A and protein B1 from genome B. Let's also say that A1 belongs to Jaccard cluster JA1 and B1 belongs to Jaccard cluster JB1. Then A and B will be placed into the same Jaccard-filtered COG if and only if the following two conditions are met:
  1. The best outgoing BLASTP edge from JA1 "lands in" (i.e. hits a protein in) JB1.
  2. The best outgoing BLASTP edge from JB1 "lands in" (i.e. hits a protein in) JA1.

A transitive closure is done using these rules so that, for example, one can end up with a Jaccard-filtered COG with representatives (Jaccard clusters) from all of the genomes, or even multiple Jaccard clusters from the same genome. Note that the phrase "best outgoing BLASTP edge" refers to the highest-scoring BLASTP hit whose query sequence resides in Jaccard cluster JA1 and whose subject/target sequence resides in Jaccard cluster JB2. We are currently using the BLASTP bit score to determine which hit is "best", and ties are broken in an arbitrary (though hopefully deterministic fashion e.g., the first hit to appear in the BLASTP output is taken.) In most common situations such boundary conditions will not be applicable, since the Jaccard clustering phase should collect similar sequences into the same Jaccard cluster, meaning that exactly which of them has the "best" BLASTP match will be irrelevant.

Parameters: Depends on:

PROmer/NUCmer analysis

PROmer and NUCmer are two of the tools available in the MUMmer suite of sequence alignment tools. Both PROmer and NUCmer perform fast pairwise local sequence alignments using a suffix tree data structure. Both also use genomic sequences as input, but PROmer generates alignments based on conceptual 6-frame protein translations of the input sequences and NUCmer generates alignments using the untranslated DNA sequences.

Parameters: (see the PROmer and NUCmer documentation for detailed descriptions of these) Depends on:

Position effect analysis

Position effect is an algorithm that uses pairwise protein matches to identify distinct sequence segments in which the gene order and orientation are conserved. Conservation of gene order is identified by calculating an optimal alignment between ordered sets of genes within the sequence segments. The ordered sets of genes are obtained by first identifying putative homologs between proteins contained within two sequence segments, usually using the results of BLASTP. The homologous proteins of each match pair are defined as anchor points in a window of adjacent proteins upstream and downstream from the anchor point. The window size is a configurable parameters and genes within each window define an ordered set. Pairwise protein matches between the ordered sets, from a matching algorithm such as BLASTP, are used to generate an optimal alignment between the ordered sets. The optimal alignment uses a scoring function that relies on scores for matches between pairs of proteins in the ordered sets. This pairwise match score is usually the percent similarity score from BLASTP but it is configurable to use other scores. This algorithm has been utilized in the following publications:
Carlton, JM, Angiuoli, SV, Suh, BB, Kooij, TW, Pertea, M, Silva, JC, Ermolaeva, MD, Allen, JE, Selengut, JD, Koo, HL, et al.:et all. Genome sequence and comparative analysis of the model rodent malaria parasite Plasmodium yoelii yoelii. Nature 2002, 419:512-519.

Parameters: Depends on:

Syntenic block analysis

This analysis identifies regions of conserved synteny (syntenic blocks) using Jaccard-filtered COGs (J-fCOGs) or any other protein-clustering or protein-matching algorithm.

In this algorithm a match is defined as a pair of genes that belong to the same J-fCOG, such that one gene is located on a designated reference sequence (e.g., P. falciparum chromosome 14) and the other is located on a [different] query sequence. A "syntenic block" is a set of matches in which all the reference genes are located on the same reference sequence and all the query genes are located on the same query sequence. The "extent" (or location) of a syntenic block on one of the two sequences is defined as the smallest sequence interval that completely contains all of the genes that both: a) appear in one of the block's matches and b) are located on that sequence. For each reference sequence we consider all possible query sequences that have 2 or more matches with the reference. An initially empty list of partial syntenic blocks is created and the matches are sorted according to the positions of the reference genes, ignoring their orientation. The algorithm then iterates over the ordered matches and for each match the current list of partial syntenic blocks is searched to determine whether 0, 1, or 2 blocks are "near" the match (see below). If no blocks are near the match then the match is used to create a new partial syntenic block. If a single partial syntenic block is near the match then the match is added to that block. If two partial syntenic blocks are near the match (i.e., one on either side) then the two partial blocks are merged and the match is added to the newly-merged block. Once all the matches have been examined, the resulting syntenic blocks for the reference sequence are exactly those partial blocks that contain min_block_size_genes (see below) or more distinct genes and whose extent is at least min_block_size_kb (see below) kb, on both the reference and the target sequence.

A match is considered to be "near" a partial syntenic block if these conditions hold:

  1. the genes in the match and the partial syntenic block reside on the same pair of sequences
  2. there are <= max_gap_genes genes between the match's reference gene and the 3' end of the partial block's extent on the reference sequence
  3. there are <= max_gap_genes genes between the match's query gene and the 3' end of the partial block's extent on the query sequence

An exception is made for clusters of tandemly-repeated matching genes, which are counted as a single matching gene for the purposes of this calculation. The orientation of each partial block (on any given sequence) is determined by examining all of the matching genes added to the block thus far and counting how many were to the 5' of the previous matching gene, versus how many were to the 3' of the previous matching gene. Blocks with more 5' additions are considered to be in the reverse orientation (with respect to that sequence.)

Parameters:

Note that most of the above parameters can be set to different values for the reference and target genomes, in order to account for differences between the two, if so desired.

Depends on:

SNP-discovery analysis

The current SNP-discovery analysis is a two step process. In the first step, any sequence contigs or fragments from a partially-finished genome are tiled along a reference sequence using NUCmer from the MUMmer 3.0 software package. In the second step, MUMmer is run to identify exact sequence matches between the tiled sequence and the reference sequence. These exact matches are masked out during a post-processing step to identify a set of polymorphisms between the two sequences. Further processing of the putative polymorphisms is done to identify likely SNPs. These SNPs are then loaded into the database where they can be viewed in the context of sequence coverage and/or quality data in Sybil.

Parameters: Depends on:
questions/comments: sybil-info
last updated Oct. 5, 2012
SourceForge.net Logo