Automated Genome Analysis using MAGPIE

magpie.gif (4106 bytes)


Christoph W. Sensen1a, Paul Gordon1, Philip Denno1 and Terry Gaasterland2b

1 National Research Council of Canada, Institute for Marine Biosciences, 1411 Oxford Street, Halifax, NS, Canada B3H 3Z1

2 The Rockefeller University, 1230 York Avenue, New York, NY 10021-6399 U.S.A.

a e-mail: csensen@ucalgary.ca Web: http://www.cbr.nrc.ca/sensencw

b e-mail: gaasterl@twyla.rockefeller.edu Web: http://www.mcs.anl.gov/home/gaasterl


The world of biology has recently seen dramatic changes. Genomics has revolutionized the way biology is done. To date more than 50 microbial and more than a dozen eukaryotic genomes are being completely sequenced, with almost 20 complete genomes already publicly available. (For a complete list see http://www.mcs.anl.gov/home/gaasterl/genomes.html). The huge amount of data from genome projects has created a need for the automation of the genome analysis and annotation. We have created an automated genome analysis and annotation system called MAGIPE (the Multipurpose Automated Genome Project Investigation Environment). It integrates database search tools (e.g. BLAST and FASTA) with other analyses and prepares html-based reports that summarize genomic features. Databases (some genomic databases are updated nightly) can be searched frequently and analyses can be redone when new information for a part of the genome is available.

MAGPIE was initially developed for the analysis of the archaeal Sulfolobus solfataricus P2 genome. (More information about the MAGPIE system is available at http://www.mcs.anl.gov/home/gaasterl/magpie.html. A demonstration of the MAGPIE Sulfolobus analysis is available at http://niji.imb.nrc.ca/sulfolobus). More information about the different MAGPIE views is available from http://niji.imb.nrc.ca/sulfhome/genann.html. Today, the system is used for the analysis and annotation of more than 20 genomes. MAGPIE is a portable system that can be installed on virtually any UNIX platform. Aside from Sicstus Prolog it uses mostly public domain components, e.g. Perl5, the GDpm library and the Apache Web server. Successful MAGPIE installations have been performed on SUN OS4, SUN Solaris, SUN Solaris x86 and DEC OSF1 (Alpha UNIX). Other platforms are likely to work as well.

There are several features that set MAGPIE aside from the other automated genome analysis systems. All aspects of MAGPIE are fully configurable through the modification of ASCII files. Configurable items include among others:

All MAGPIE evidence is categorized by a confidence level. Level one evidence indicates a strong relationship (e.g. similarity) between a database entry and a certain region of the genome studied. Level two evidence indicates that the evidence is still quite strong, but not good enough be a homologue to a database entry. Level three evidence is better than background noise, but it is normally not good enough to make a final function call for that particular region of the genome.

An important feature is the genomic context that is implemented in MAGPIE. Many genome projects work on a contig by contig basis, thus breaking the genome into smaller pieces, then sequencing these pieces (which are called contigs) and afterwards merging them to reconstruct the entire genome. MAGPIE automates the process of merging contigs and creating the entire genome from a set of smaller contigs. MAGPIE automatically maps the individual contigs, eliminates the overlapping (redundant) regions and maps sequence features into the larger context. MAGPIE also organizes the non-redundant sequence in local databases that can be searched through a Web form.

MAGPIE, creates a set of tables that contain the genomic features associated with a particular region of the genome. While other genome analysis and annotation systems only present these feature tables in html format, MAGPIE also provides an extensive set of graphical overviews of the genome analysis. To our knowledge, no other system has a similar comprehensive graphical summary than MAGPIE. The Graphical summaries allow to display a degree of complexity that cannot be derived by table based presentations.

Graphical overviews include:

All graphical overviews contain interactive imagemaps that link to more detailed graphical views of the evidence.

MAGPIE indexes the genomic features identified. Fast database searching is provided for DNA sequences, protein sequences, DNA and protein motifs and keywords (searching the entire dataset). Searching through the entire evidence for a microbial genome for a keyword like "protease" takes less than two minutes on a SUN workstation. MAGPIE also supports complex queries (e.g. AND, OR, NOT) that allow a simultaneous search for multiple keywords. All search outputs are reformatted into html and displayed in the Web browser.

All genes identified through MAGPIE are categorized by phlyogenetic origin. Genes fall into one of the following categories: BAE (universal genes or genes shared between the bacterial, archaeal and eukaryotic phylogenetic domain) BA (bacterial and archaeal only) BE (bacterial and eukaryotic only), AE (archaeal and eukaryotic only), B (bacterial), A (archaeal) or E (eukaryotic). This classification allows studying of the evolution of metabolic pathways or structural features in cells. The phylogenetic information is also very valuable in the identification of pharmaceutical target genes.

MAGPIE’s includes a form-based annotation system that allows experts to override the results of the automated genome analysis. Through this system, verification results can be added to the knowledge database and conflicting evidence can be resolved. This is crucial before publication or database submission to ascertain that the information distributed is as accurate as possible. No fully automated genome analysis and annotation can resolve conflicts that arise from errors that are contained in databases.

MAGPIE has been studied extensively by Alexander Sczyrba in the group of Robert Giegerich at the University of Bielefeld. Alexander compared several genome analysis systems for his Diplomarbeit (similar to a master's thesis). The thesis is in German only. We are linking to an Adobe .pdf file of the thesis.


Further information about the MAGPIE system and its availability can be obtained from:

Terry Gaasterland: email gaasterl@twyla.rockefeller.edu

Or Christoph Sensen: email sensencw@niji.imb.nrc.ca


06.07.1999