Documentation index
Technical overview
The Sybil software package provides a primarily web-based front-end to comparative genome datasets
warehoused in a chado relational database. It was developed by the
bioinformatics department at The Institute for Genomic Research (
TIGR) and
development continues at the J. Craig Venter Institute (
JCVI) and the
Institute for Genome Sciences (
IGS) at the University of Maryland: Baltimore.
Sybil has been used at TIGR/JCVI, IGS, NYU, New York Medical College, Novartis Vaccines and University
of Maryland: College Park to support a number of research projects that
involve comparative genome analysis. The following sections provide some high-level technical details about
the overall architecture and external dependencies of the Sybil package.
Sybil system architecture
Sybil is implemented primarily in Perl and makes use of the same 3-tiered software framework ("Coati") as
Manatee, a related TIGR project. In Coati a set of common Perl
modules handle routine tasks such as logging, URL management, low-level database access, and query result caching.
Its 3-tiered architecture decouples the presentation layer (the "top" tier, consisting of a set of Perl CGI scripts)
from the database access layer (the "bottom" tier, parts of which may depend on the database schema and/or vendor.)
This is done by way of a schema-independent middle tier that presents a uniform API to scripts in the
presentation layer. Sybil currently supports only a single database
schema--the chado schema--but has support
for both Sybase and PostgreSQL via Perl DBI and one or more optional database-dependent modules in the database
access layer. On system startup a factory object is used to create an instance of the correct database and
vendor-specific Perl module (which may be a trivial subclass of a shared, vendor-independent module in the
case that none of the SQL queries need to be customized for that particular DBMS.)
The Perl CGI scripts in the presentation layer make use of HTML::Template
and a set of template files to handle result formatting (e.g., as HTML or tab-delimited plain text) wherever possible. This prevents
the Perl CGI scripts from becoming cluttered with hard-coded HTML tags and allows the system to support
new display formats simply by adding the appropriate set of template files. Similarly, URLs (including
those of the Perl CGI scripts themselves) are stored in a separate set of XML files and are referenced
only by logical name. This ensures that each distinct URL appears only once in a centralized location,
greatly simplifying maintenance.
Additional modules specific to Sybil handle generating most of the
graphical displays supported by the system. All of the graphical
display scripts can generate output as either vector (SVG) or bitmapped (e.g., PNG, JPEG) graphics. In
some cases SVG support is provided by GD::SVG
and in others it is implemented directly as an alternate
rendering method to GD. Some displays can be configured to generate higher-quality bitmapped images by
rendering first to SVG and then using the batik-rasterizer tool from the Apache Batik
package to produce a PNG or JPEG image. PDF format figures for publication are also generated by converting
SVG files to PDF with batik-rasterizer.
Open source tools leveraged by Sybil
Both Sybil and the other software packages on which it depends (i.e., those used to populate and maintain
the comparative database) leverage the contributions of several groups that have developed
bioinformatics tools and databases under the open source model. Here are some of the open source tools
and databases employed by Sybil and the other software components that we use to generate, store, and
maintain the genome-scale comparative datasets displayed by the system:
- The chado relational database schema, the reference schema for GMOD
- BSML (Bioinformatic Sequence Markup Language)
- The Perl programming language
- Bioperl (particularly Lincoln Stein's Bio::Graphics package)
- The Apache Batik SVG toolkit
- The MUMmer suffix-tree-based suite of alignment tools
Preparing a comparative database for use in Sybil
Before Sybil can be used to view comparative genome data, a chado database containing the comparative data
must be prepared. Here is a rough outline of the steps required to set up a new comparative analysis
resource that can be interrogated via Sybil:
- Create an initially empty (with the exception of some support tables) chado comparative database.
- Reserve space on disk for a "BSML repository" in which all BSML files related to the project will be stored.
- Convert the (typically single-organism) input datasets into BSML and place these files in the repository.
- Load the BSML-formatted data into the database using a DBMS-specific bulk loader (e.g., bcp for Sybase.)
- Run a series of analyses/computes on the data These computes:
- Typically both read from and write to the BSML repository.
- May be run on the entire database (e.g., all-vs-all blastp) or any subset thereof (e.g., single organism protein clustering.)
- Are usually encoded by workflows that are run by the TIGR workflow engine (see below.)
- May depend on the results of other computes (e.g., protein clustering depends on all-vs-all blastp)
- Load the BSML-formatted results of the computes into the database via the bulk loader API.
- Configure the Sybil web-based interface to connect to the appropriate chado database.
- Repeat compute/load steps as needed (e.g., to adjust parameter settings or explore new comparisons.)
Software components involved in data generation and management
Finally, here are some of the software components that are used in the procedure described above. All of the
following components were developed at TIGR:
- A set of Perl modules for reading, writing, and validating BSML-formatted XML data.
- Scripts for converting "legacy" data into BSML files suitable for import into chado.
- Scripts for converting BSML files into the format understood by the bulk loader utility.
- The TIGR workflow engine, a generic Java-based workflow processing system that:
- Simplifies the execution of complex multi-step procedures with embedded dependencies.
- Automatically handles distributing CPU-intensive tasks onto a multi-node compute grid/cluster.
- Allows halted workflows to be resumed from the point at which failure occurred.
The TIGR workflow engine also being packaged for public release; see
the relevant SourceForge page.
- A set of XML-encoded workflows that handle many of the steps mentioned in the list above.
- Tools for creating new workflows (by chaining together existing subflows) and monitoring their execution progress.
Computes/algorithms
What follows is a partial list of the analyses supported by the current system. Along with each analysis
we list some of the key parameters that can be used to tailor it to a specific application, and also the
steps and/or analyses that must precede it. Many of these analyses are encoded in workflows that can be
run on a distributed compute cluster via the TIGR workflow engine, as discussed above. Some, however, can
be invoked on-demand as part of a complex visualization or web display (e.g.,
the
syntenic blocks analysis.)
All-vs-all BLASTP analysis
A simple all-vs-all BLAST matrix that runs
WashU-BLASTP on every pair of proteins in the database.
Parameters:
- expect - only load GSPs/subjects with a smaller E-value than this
- filter - optionally perform low-complexity filtering (e.g., seg, xnu) on the query sequence
- gspmax - limit number of GSPs/HSPs per subject sequence
- matrix - set scoring matrix (e.g., BLOSUM62)
Depends on:
- Initial population of BSML repository with input genomes/proteomes.
Jaccard-clustering analysis
A protein clustering analysis that is typically used to group together highly similar proteins within
a single genome/organism of interest. In the sample database, for example, this analysis has been
run three times, once for each of the organisms (
P. falciparum,
P. yoelii, and
C. parvum). The Jaccard clustering algorithm relies on two parameters:
- a Jaccard coefficient threshold between 0.0 and 1.0
- a minimum BLASTP percent identity threshold between 0% and 100%
Typically we have used a Jaccard coefficient threshold of 0.6 and a minimum BLASTP
percent identity score of 80%; these values were chosen based on some
small empirical studies. Given these two parameters we first make a graph in which the
nodes of the graph are the proteins in the genome undergoing Jaccard clustering
and the edges of the graph are all of the
all-vs-all BLASTP matches
with 80% identity or better. For any two proteins (nodes) A and B in
the graph we define the Jaccard similarity coefficient as the
number of nodes (other proteins) that are connected directly to both A AND B,
divided by the number of nodes that are connected directly to either A OR B.
A new graph is constructed with the same set of nodes, but where an
edge is drawn between two proteins if and only if the Jaccard
similarity coefficient for those two proteins is greater than or equal to the chosen
threshold (i.e., 0.6). Intuitively, the Jaccard similarity
coefficient is essentially measuring how similarly-connected A and B are. An
obvious extension to the algorithm would be to take into account the percentage
of the protein sequences covered by the BLASTP matches, in addition to just
looking at the percent identity. There are situations, however, in which it
is advantageous to allow proteins of significantly different lengths to be
clustered together (e.g., when looking for gene annotation problems in an unfinished
genome based on comparisons to one or more finished and manually-curated genomes.)
Parameters:
- linkscore - Jaccard coefficient threshold between 0 and 1 (see above)
- pidentity_cutoff - Minimum BLASTP percent identity (see above)
- pvalcut - Maximum BLASTP p-value threshold to apply when creating the initial graph
- max_multi_alignment - Limit computation of clustalw alignments (a post-processing step) to clusters of this size or smaller
Depends on:
Sybil COG analysis
In contrast with the
Jaccard clustering analysis, this protein clustering algorithm is
typically used to cluster proteins from
distinct genomes/organisms in order to
identify orthologous genes. In practice the algorithm is a simple bidirectional best
BLASTP hit analysis; it groups together any pair of proteins for which each is the other's
best BLASTP hit (in their respective genomes.) Unfortunately it is easily confuted by
situations in which one or more of the genomes contain families of closely-related
paralogs. The algorithm will have difficulty resolving the differences between the
paralogs, meaning that some of them may not be clustered, and those that are clustered
successfully will not necessarily all end up in the same cluster.
Parameters:
- pvalcut - Maximum BLASTP p-value to use when looking for bidirectional best BLASTP matches.
- max_multi_alignment - Limit computation of clustalw alignments (a post-processing step) to clusters of this size or smaller
Depends on:
Jaccard-filtered COG analysis
Run a bidirectional best BLASTP hit clustering algorithm (the
Sybil COG analysis) on all of the genomes, taking into account the Jaccard clusters. The algorithm only considers pairs of proteins from different genomes, so let's say that the algorithm is looking at protein A1 from genome A and protein B1 from genome B. Let's also say that A1 belongs to Jaccard cluster JA1 and B1 belongs to Jaccard cluster JB1. Then A and B will be placed into the same Jaccard-filtered COG if and only if the following two conditions are met:
- The best outgoing BLASTP edge from JA1 "lands in" (i.e. hits a protein in) JB1.
- The best outgoing BLASTP edge from JB1 "lands in" (i.e. hits a protein in) JA1.
A transitive closure is done using these rules so that, for example, one can end up with a Jaccard-filtered COG with representatives (Jaccard clusters) from all of the genomes, or even multiple Jaccard clusters from the same genome. Note that the phrase "best outgoing BLASTP edge" refers to the highest-scoring BLASTP hit whose query sequence resides in Jaccard cluster JA1 and whose subject/target sequence resides in Jaccard cluster JB2. We are currently using the BLASTP bit score to determine which hit is "best", and ties are broken in an arbitrary (though hopefully deterministic fashion e.g., the first hit to appear in the BLASTP output is taken.) In most common situations such boundary conditions will not be applicable, since the Jaccard clustering phase should collect similar sequences into the same Jaccard cluster, meaning that exactly which of them has the "best" BLASTP match will be irrelevant.
Parameters:
- jaccard_dir - Specifies the Jaccard clusters to use for the first step of the clustering process.
- pvalcut - Maximum BLASTP p-value to use when looking for bidirectional best BLASTP matches.
- max_multi_alignment - Limit computation of clustalw alignments (a post-processing step) to clusters of this size or smaller
Depends on:
PROmer/NUCmer analysis
PROmer and NUCmer are two of the tools available in the
MUMmer
suite of sequence alignment tools. Both PROmer and NUCmer perform fast pairwise local sequence alignments using
a suffix tree data structure. Both also use genomic sequences as input, but PROmer generates alignments based on
conceptual 6-frame protein translations of the input sequences and NUCmer generates alignments using the
untranslated DNA sequences.
Parameters: (see the
PROmer and
NUCmer documentation for detailed descriptions of these)
- breaklen
- diagfactor
- masklen
- maxgap
- mincluster
- minmatch
Depends on:
- Initial population of BSML repository with input genomes.
Position effect analysis
Position effect is an algorithm
that uses pairwise protein matches to identify distinct sequence
segments in which the gene order and orientation are
conserved. Conservation of gene order is identified by calculating an
optimal alignment between ordered sets of genes within the sequence
segments. The ordered sets of genes are obtained by first identifying
putative homologs between proteins contained within two sequence
segments, usually using the results of BLASTP. The homologous
proteins of each match pair are defined as anchor points in a window
of adjacent proteins upstream and downstream from the anchor point.
The window size is a configurable parameters and genes within each
window define an ordered set. Pairwise protein matches between the
ordered sets, from a matching algorithm such as BLASTP, are used to
generate an optimal alignment between the ordered sets. The optimal
alignment uses a scoring function that relies on scores for matches
between pairs of proteins in the ordered sets. This pairwise match
score is usually the percent similarity score from BLASTP but it is
configurable to use other scores. This algorithm has been utilized in
the following publications:
Carlton, JM, Angiuoli, SV, Suh, BB, Kooij, TW, Pertea, M, Silva, JC, Ermolaeva, MD, Allen, JE, Selengut, JD, Koo, HL, et al.:et all. Genome sequence and comparative analysis of the model rodent malaria parasite
Plasmodium yoelii yoelii.
Nature 2002,
419:512-519.
Parameters:
- window_size - Number of adjacent genes to use when defining each ordered set. Default 10
- gap_penalty - Penalty for non-matching genes in an alignment of ordered sets. Default -30.
- gene_count_cutoff - Minimum number of genes in an ordered set. Ordered sets below this cutoff are ignored. Default 2.
- gene_length_cutoff - Minimum length of genes in bp. Default 0.
- min_matches_per_window - Minimum number of matching genes per ordered set. Ordered sets below this cutoff are ignored. Default 2.
- orientation_penalty - Penalty for matches that occur "out of orientation" with respect to the orientation of the ordered set. Default -100.
- rankfield - Score used to determine ranking of pairwise match scores when multiple pairwise scores are available. Default BLASTP p-value.
- ranktype - Not currently implmented
- scorefield - The pairwise match score used to quantify matches between genes in an alignment of ordered sets. Default BLASTP percent similarity
Depends on:
- Any type of protein-protein matching algorithm, including all-vs-all BLASTP and the protein clustering analyses.
Syntenic block analysis
This analysis identifies regions of conserved synteny (syntenic blocks) using Jaccard-filtered COGs (J-fCOGs) or any other protein-clustering or protein-matching algorithm.
In this algorithm a match is defined as a pair of genes that belong to the same J-fCOG, such that one gene is located on a designated reference sequence (e.g., P. falciparum chromosome 14) and the other is located on a [different] query sequence. A "syntenic block" is a set of matches in which all the reference genes are located on the same reference sequence and all the query genes are located on the same query sequence. The "extent" (or location) of a syntenic block on one of the two sequences is defined as the smallest sequence interval that completely contains all of the genes that both: a) appear in one of the block's matches and b) are located on that sequence. For each reference sequence we consider all possible query sequences that have 2 or more matches with the reference. An initially empty list of partial syntenic blocks is created and the matches are sorted according to the positions of the reference genes, ignoring their orientation. The algorithm then iterates over the ordered matches and for each match the current list of partial syntenic blocks is searched to determine whether 0, 1, or 2 blocks are "near" the match (see below). If no blocks are near the match then the match is used to create a new partial syntenic block. If a single partial syntenic block is near the match then the match is added to that block. If two partial syntenic blocks are near the match (i.e., one on either side) then the two partial blocks are merged and the match is added to the newly-merged block. Once all the matches have been examined, the resulting syntenic blocks for the reference sequence are exactly those partial blocks that contain min_block_size_genes (see below) or more distinct genes and whose extent is at least min_block_size_kb (see below) kb, on both the reference and the target sequence.
A match is considered to be "near" a partial syntenic block if these conditions hold:
- the genes in the match and the partial syntenic block reside on the same pair of sequences
- there are <= max_gap_genes genes between the match's reference gene and the 3' end of the partial block's extent on the reference sequence
- there are <= max_gap_genes genes between the match's query gene and the 3' end of the partial block's extent on the query sequence
An exception is made for clusters of tandemly-repeated matching genes, which are counted as a single matching gene for the purposes of this calculation. The orientation of each partial block (on any given sequence) is determined by examining all of the matching genes added to the block thus far and counting how many were to the 5' of the previous matching gene, versus how many were to the 3' of the previous matching gene. Blocks with more 5' additions are considered to be in the reverse orientation (with respect to that sequence.)
Parameters:
- refseq - reference sequence
- target_organisms - optional list of target organisms
- target_seqs - optional list of target sequences
- min_block_size_kb - minimum syntenic block size in kilobases
- min_block_size_genes - minimum syntenic block size in number of genes
- max_gap_genes - maximum number of adjacent nonmatching genes that may appear in a block
Note that most of the above parameters can be set to different values for the reference and target
genomes, in order to account for differences between the two, if so desired.
Depends on:
SNP-discovery analysis
The current SNP-discovery analysis is a two
step process. In the first step, any sequence contigs or fragments
from a partially-finished genome are tiled along a reference sequence
using NUCmer from the
MUMmer 3.0 software
package. In the second step, MUMmer is run to identify exact sequence
matches between the tiled sequence and the reference sequence. These
exact matches are masked out during a post-processing step to identify
a set of polymorphisms between the two sequences. Further processing
of the putative polymorphisms is done to identify likely SNPs. These
SNPs are then loaded into the database where they can be viewed in the
context of sequence coverage and/or quality data in Sybil.
Parameters:
- tiling_coverage_threshold - Minimum percent of sequence coverage for tiling fragments. Default 80%.
- tiling_maximum_gap - Maximum gap between tiling fragments. Default -1.
- mummer_minimum_match_length - Minimum MUM length for MUMmer. Default 20.
- mgaps_minimum_cluster_match_length - Minimum length of cluster match for mgaps program in NUCmer package. Default 100.
- mgaps_maximum_seperation_between_clusters - Maximum separation between matches in cluster for mgaps. Default 600.
- mgaps_fractional_diagonal_distance - Fraction of separation for diagonal difference for mgaps. Default .12.
- combine_mums_error_cutoff - Error-rate cutoff for combineMUMs. Default .10
Depends on:
- Initial population of BSML repository with input genomes.
- Loading of sequence coverage/quality data (optional)