Sybil: Documentation

Sybil is implemented primarily in Perl and makes use of the same 3-tiered software framework ("Coati") as Manatee, a related TIGR project. In Coati a set of common Perl modules handle routine tasks such as logging, URL management, low-level database access, and query result caching. Its 3-tiered architecture decouples the presentation layer (the "top" tier, consisting of a set of Perl CGI scripts) from the database access layer (the "bottom" tier, parts of which may depend on the database schema and/or vendor.) This is done by way of a schema-independent middle tier that presents a uniform API to scripts in the presentation layer. Sybil currently supports only a single database schema--the chado schema--but has support for both Sybase and PostgreSQL via Perl DBI and one or more optional database-dependent modules in the database access layer. On system startup a factory object is used to create an instance of the correct database and vendor-specific Perl module (which may be a trivial subclass of a shared, vendor-independent module in the case that none of the SQL queries need to be customized for that particular DBMS.)

The Perl CGI scripts in the presentation layer make use of HTML::Template and a set of template files to handle result formatting (e.g., as HTML or tab-delimited plain text) wherever possible. This prevents the Perl CGI scripts from becoming cluttered with hard-coded HTML tags and allows the system to support new display formats simply by adding the appropriate set of template files. Similarly, URLs (including those of the Perl CGI scripts themselves) are stored in a separate set of XML files and are referenced only by logical name. This ensures that each distinct URL appears only once in a centralized location, greatly simplifying maintenance.

Additional modules specific to Sybil handle generating most of the graphical displays supported by the system. All of the graphical display scripts can generate output as either vector (SVG) or bitmapped (e.g., PNG, JPEG) graphics. In some cases SVG support is provided by GD::SVG and in others it is implemented directly as an alternate rendering method to GD. Some displays can be configured to generate higher-quality bitmapped images by rendering first to SVG and then using the batik-rasterizer tool from the Apache Batik package to produce a PNG or JPEG image. PDF format figures for publication are also generated by converting SVG files to PDF with batik-rasterizer.

Open source tools leveraged by Sybil

Both Sybil and the other software packages on which it depends (i.e., those used to populate and maintain the comparative database) leverage the contributions of several groups that have developed bioinformatics tools and databases under the open source model. Here are some of the open source tools and databases employed by Sybil and the other software components that we use to generate, store, and maintain the genome-scale comparative datasets displayed by the system:

The chado relational database schema, the reference schema for GMOD
BSML (Bioinformatic Sequence Markup Language)
The Perl programming language
Bioperl (particularly Lincoln Stein's Bio::Graphics package)
The Apache Batik SVG toolkit
The MUMmer suffix-tree-based suite of alignment tools

Preparing a comparative database for use in Sybil

Before Sybil can be used to view comparative genome data, a chado database containing the comparative data must be prepared. Here is a rough outline of the steps required to set up a new comparative analysis resource that can be interrogated via Sybil:

Create an initially empty (with the exception of some support tables) chado comparative database.
Reserve space on disk for a "BSML repository" in which all BSML files related to the project will be stored.
Convert the (typically single-organism) input datasets into BSML and place these files in the repository.
Load the BSML-formatted data into the database using a DBMS-specific bulk loader (e.g., bcp for Sybase.)
Run a series of analyses/computes on the data These computes:
- Typically both read from and write to the BSML repository.
- May be run on the entire database (e.g., all-vs-all blastp) or any subset thereof (e.g., single organism protein clustering.)
- Are usually encoded by workflows that are run by the TIGR workflow engine (see below.)
- May depend on the results of other computes (e.g., protein clustering depends on all-vs-all blastp)
Load the BSML-formatted results of the computes into the database via the bulk loader API.
Configure the Sybil web-based interface to connect to the appropriate chado database.
Repeat compute/load steps as needed (e.g., to adjust parameter settings or explore new comparisons.)

Software components involved in data generation and management

Finally, here are some of the software components that are used in the procedure described above. All of the following components were developed at TIGR:

A set of Perl modules for reading, writing, and validating BSML-formatted XML data.
Scripts for converting "legacy" data into BSML files suitable for import into chado.
Scripts for converting BSML files into the format understood by the bulk loader utility.
The TIGR workflow engine, a generic Java-based workflow processing system that:
- Simplifies the execution of complex multi-step procedures with embedded dependencies.
- Automatically handles distributing CPU-intensive tasks onto a multi-node compute grid/cluster.
- Allows halted workflows to be resumed from the point at which failure occurred.
The TIGR workflow engine also being packaged for public release; see the relevant SourceForge page.
A set of XML-encoded workflows that handle many of the steps mentioned in the list above.
Tools for creating new workflows (by chaining together existing subflows) and monitoring their execution progress.

Computes/algorithms

All-vs-all BLASTP analysis

A simple all-vs-all BLAST matrix that runs WashU-BLASTP on every pair of proteins in the database.

Parameters:

expect - only load GSPs/subjects with a smaller E-value than this
filter - optionally perform low-complexity filtering (e.g., seg, xnu) on the query sequence
gspmax - limit number of GSPs/HSPs per subject sequence
matrix - set scoring matrix (e.g., BLOSUM62)

Depends on:

Initial population of BSML repository with input genomes/proteomes.

Jaccard-clustering analysis

A protein clustering analysis that is typically used to group together highly similar proteins within a single genome/organism of interest. In the sample database, for example, this analysis has been run three times, once for each of the organisms (P. falciparum, P. yoelii, and C. parvum). The Jaccard clustering algorithm relies on two parameters:

a Jaccard coefficient threshold between 0.0 and 1.0
a minimum BLASTP percent identity threshold between 0% and 100%

Typically we have used a Jaccard coefficient threshold of 0.6 and a minimum BLASTP percent identity score of 80%; these values were chosen based on some small empirical studies. Given these two parameters we first make a graph in which the nodes of the graph are the proteins in the genome undergoing Jaccard clustering and the edges of the graph are all of the all-vs-all BLASTP matches with 80% identity or better. For any two proteins (nodes) A and B in the graph we define the Jaccard similarity coefficient as the number of nodes (other proteins) that are connected directly to both A AND B, divided by the number of nodes that are connected directly to either A OR B. A new graph is constructed with the same set of nodes, but where an edge is drawn between two proteins if and only if the Jaccard similarity coefficient for those two proteins is greater than or equal to the chosen threshold (i.e., 0.6). Intuitively, the Jaccard similarity coefficient is essentially measuring how similarly-connected A and B are. An obvious extension to the algorithm would be to take into account the percentage of the protein sequences covered by the BLASTP matches, in addition to just looking at the percent identity. There are situations, however, in which it is advantageous to allow proteins of significantly different lengths to be clustered together (e.g., when looking for gene annotation problems in an unfinished genome based on comparisons to one or more finished and manually-curated genomes.)

Parameters:

linkscore - Jaccard coefficient threshold between 0 and 1 (see above)
pidentity_cutoff - Minimum BLASTP percent identity (see above)
pvalcut - Maximum BLASTP p-value threshold to apply when creating the initial graph
max_multi_alignment - Limit computation of clustalw alignments (a post-processing step) to clusters of this size or smaller

Depends on:

All-vs-all BLASTP analysis for the set of proteins being clustered.

Sybil COG analysis

In contrast with the Jaccard clustering analysis, this protein clustering algorithm is typically used to cluster proteins from distinct genomes/organisms in order to identify orthologous genes. In practice the algorithm is a simple bidirectional best BLASTP hit analysis; it groups together any pair of proteins for which each is the other's best BLASTP hit (in their respective genomes.) Unfortunately it is easily confuted by situations in which one or more of the genomes contain families of closely-related paralogs. The algorithm will have difficulty resolving the differences between the paralogs, meaning that some of them may not be clustered, and those that are clustered successfully will not necessarily all end up in the same cluster.

Parameters:

pvalcut - Maximum BLASTP p-value to use when looking for bidirectional best BLASTP matches.
max_multi_alignment - Limit computation of clustalw alignments (a post-processing step) to clusters of this size or smaller

Depends on:

All-vs-all BLASTP analysis for the set of proteins being clustered.

Jaccard-filtered COG analysis

Run a bidirectional best BLASTP hit clustering algorithm (the Sybil COG analysis) on all of the genomes, taking into account the Jaccard clusters. The algorithm only considers pairs of proteins from different genomes, so let's say that the algorithm is looking at protein A1 from genome A and protein B1 from genome B. Let's also say that A1 belongs to Jaccard cluster JA1 and B1 belongs to Jaccard cluster JB1. Then A and B will be placed into the same Jaccard-filtered COG if and only if the following two conditions are met:

The best outgoing BLASTP edge from JA1 "lands in" (i.e. hits a protein in) JB1.
The best outgoing BLASTP edge from JB1 "lands in" (i.e. hits a protein in) JA1.

A transitive closure is done using these rules so that, for example, one can end up with a Jaccard-filtered COG with representatives (Jaccard clusters) from all of the genomes, or even multiple Jaccard clusters from the same genome. Note that the phrase "best outgoing BLASTP edge" refers to the highest-scoring BLASTP hit whose query sequence resides in Jaccard cluster JA1 and whose subject/target sequence resides in Jaccard cluster JB2. We are currently using the BLASTP bit score to determine which hit is "best", and ties are broken in an arbitrary (though hopefully deterministic fashion e.g., the first hit to appear in the BLASTP output is taken.) In most common situations such boundary conditions will not be applicable, since the Jaccard clustering phase should collect similar sequences into the same Jaccard cluster, meaning that exactly which of them has the "best" BLASTP match will be irrelevant.

Parameters:

jaccard_dir - Specifies the Jaccard clusters to use for the first step of the clustering process.
pvalcut - Maximum BLASTP p-value to use when looking for bidirectional best BLASTP matches.
max_multi_alignment - Limit computation of clustalw alignments (a post-processing step) to clusters of this size or smaller

Depends on:

A Jaccard clustering analysis must be run on the proteins of each of the input genomes.

PROmer/NUCmer analysis

PROmer and NUCmer are two of the tools available in the MUMmer suite of sequence alignment tools. Both PROmer and NUCmer perform fast pairwise local sequence alignments using a suffix tree data structure. Both also use genomic sequences as input, but PROmer generates alignments based on conceptual 6-frame protein translations of the input sequences and NUCmer generates alignments using the untranslated DNA sequences.

Parameters: (see the PROmer and NUCmer documentation for detailed descriptions of these)

breaklen
diagfactor
masklen
maxgap
mincluster
minmatch

Depends on:

Initial population of BSML repository with input genomes.

Position effect analysis

Position effect is an algorithm that uses pairwise protein matches to identify distinct sequence segments in which the gene order and orientation are conserved. Conservation of gene order is identified by calculating an optimal alignment between ordered sets of genes within the sequence segments. The ordered sets of genes are obtained by first identifying putative homologs between proteins contained within two sequence segments, usually using the results of BLASTP. The homologous proteins of each match pair are defined as anchor points in a window of adjacent proteins upstream and downstream from the anchor point. The window size is a configurable parameters and genes within each window define an ordered set. Pairwise protein matches between the ordered sets, from a matching algorithm such as BLASTP, are used to generate an optimal alignment between the ordered sets. The optimal alignment uses a scoring function that relies on scores for matches between pairs of proteins in the ordered sets. This pairwise match score is usually the percent similarity score from BLASTP but it is configurable to use other scores. This algorithm has been utilized in the following publications:
Carlton, JM, Angiuoli, SV, Suh, BB, Kooij, TW, Pertea, M, Silva, JC, Ermolaeva, MD, Allen, JE, Selengut, JD, Koo, HL, et al.:et all. Genome sequence and comparative analysis of the model rodent malaria parasite Plasmodium yoelii yoelii. Nature 2002, 419:512-519.

Parameters:

window_size - Number of adjacent genes to use when defining each ordered set. Default 10
gap_penalty - Penalty for non-matching genes in an alignment of ordered sets. Default -30.
gene_count_cutoff - Minimum number of genes in an ordered set. Ordered sets below this cutoff are ignored. Default 2.
gene_length_cutoff - Minimum length of genes in bp. Default 0.
min_matches_per_window - Minimum number of matching genes per ordered set. Ordered sets below this cutoff are ignored. Default 2.
orientation_penalty - Penalty for matches that occur "out of orientation" with respect to the orientation of the ordered set. Default -100.
rankfield - Score used to determine ranking of pairwise match scores when multiple pairwise scores are available. Default BLASTP p-value.
ranktype - Not currently implmented
scorefield - The pairwise match score used to quantify matches between genes in an alignment of ordered sets. Default BLASTP percent similarity

Depends on:

Any type of protein-protein matching algorithm, including all-vs-all BLASTP and the protein clustering analyses.

Syntenic block analysis

This analysis identifies regions of conserved synteny (syntenic blocks) using Jaccard-filtered COGs (J-fCOGs) or any other protein-clustering or protein-matching algorithm.

In this algorithm a match is defined as a pair of genes that belong to the same J-fCOG, such that one gene is located on a designated reference sequence (e.g., P. falciparum chromosome 14) and the other is located on a [different] query sequence. A "syntenic block" is a set of matches in which all the reference genes are located on the same reference sequence and all the query genes are located on the same query sequence. The "extent" (or location) of a syntenic block on one of the two sequences is defined as the smallest sequence interval that completely contains all of the genes that both: a) appear in one of the block's matches and b) are located on that sequence. For each reference sequence we consider all possible query sequences that have 2 or more matches with the reference. An initially empty list of partial syntenic blocks is created and the matches are sorted according to the positions of the reference genes, ignoring their orientation. The algorithm then iterates over the ordered matches and for each match the current list of partial syntenic blocks is searched to determine whether 0, 1, or 2 blocks are "near" the match (see below). If no blocks are near the match then the match is used to create a new partial syntenic block. If a single partial syntenic block is near the match then the match is added to that block. If two partial syntenic blocks are near the match (i.e., one on either side) then the two partial blocks are merged and the match is added to the newly-merged block. Once all the matches have been examined, the resulting syntenic blocks for the reference sequence are exactly those partial blocks that contain min_block_size_genes (see below) or more distinct genes and whose extent is at least min_block_size_kb (see below) kb, on both the reference and the target sequence.

A match is considered to be "near" a partial syntenic block if these conditions hold:

the genes in the match and the partial syntenic block reside on the same pair of sequences
there are <= max_gap_genes genes between the match's reference gene and the 3' end of the partial block's extent on the reference sequence
there are <= max_gap_genes genes between the match's query gene and the 3' end of the partial block's extent on the query sequence

An exception is made for clusters of tandemly-repeated matching genes, which are counted as a single matching gene for the purposes of this calculation. The orientation of each partial block (on any given sequence) is determined by examining all of the matching genes added to the block thus far and counting how many were to the 5' of the previous matching gene, versus how many were to the 3' of the previous matching gene. Blocks with more 5' additions are considered to be in the reverse orientation (with respect to that sequence.)

Parameters:

refseq - reference sequence
target_organisms - optional list of target organisms
target_seqs - optional list of target sequences
min_block_size_kb - minimum syntenic block size in kilobases
min_block_size_genes - minimum syntenic block size in number of genes
max_gap_genes - maximum number of adjacent nonmatching genes that may appear in a block

Note that most of the above parameters can be set to different values for the reference and target genomes, in order to account for differences between the two, if so desired.

Depends on:

Jaccard-filtered COG analysis or any analysis that produces protein-protein matches.

SNP-discovery analysis

The current SNP-discovery analysis is a two step process. In the first step, any sequence contigs or fragments from a partially-finished genome are tiled along a reference sequence using NUCmer from the MUMmer 3.0 software package. In the second step, MUMmer is run to identify exact sequence matches between the tiled sequence and the reference sequence. These exact matches are masked out during a post-processing step to identify a set of polymorphisms between the two sequences. Further processing of the putative polymorphisms is done to identify likely SNPs. These SNPs are then loaded into the database where they can be viewed in the context of sequence coverage and/or quality data in Sybil.

Parameters:

tiling_coverage_threshold - Minimum percent of sequence coverage for tiling fragments. Default 80%.
tiling_maximum_gap - Maximum gap between tiling fragments. Default -1.
mummer_minimum_match_length - Minimum MUM length for MUMmer. Default 20.
mgaps_minimum_cluster_match_length - Minimum length of cluster match for mgaps program in NUCmer package. Default 100.
mgaps_maximum_seperation_between_clusters - Maximum separation between matches in cluster for mgaps. Default 600.
mgaps_fractional_diagonal_distance - Fraction of separation for diagonal difference for mgaps. Default .12.
combine_mums_error_cutoff - Error-rate cutoff for combineMUMs. Default .10

Depends on:

Initial population of BSML repository with input genomes.
Loading of sequence coverage/quality data (optional)

Sybil: Web-based software for comparative genomics

Documentation index

Technical overview

Sybil system architecture

Open source tools leveraged by Sybil

Preparing a comparative database for use in Sybil

Software components involved in data generation and management

Computes/algorithms

All-vs-all BLASTP analysis

Jaccard-clustering analysis

Sybil COG analysis

Jaccard-filtered COG analysis

PROmer/NUCmer analysis

Position effect analysis

Syntenic block analysis

SNP-discovery analysis