XML for Molecular Biology as compiled by Paul Gordon
DTDs | Data |
Tools | Community |
References
For my past and present research work, please see my homepage.
My most current work on XML in Bioinformatics integration is
Moby Web Services creation tool called Daggoo.
I've collected here a list of XML resources that may be of use to
the bioinformatician. If you have anything to add, or any
comments/corrections, please don't hestitate to
contact me (gordonp@ucalgary.ca).
Some XML definitions, mostly with implementations
Note that local copies of DTDs are not necessarily the most recent versions.
Sequence/Annotation
- AGAVE - An Architecture for Genomic Annotation, Visualization and Exchange from the AGAVE Community lead by DoubleTwist.
- BIOML -
Proteometric's attempt to
describe nucleotide and peptide sequences, and they've got a nice
browser.
- Bioseq-set - Output format
for XML option in readseq
- BSML (data only or
full) -
LabBook.com's attempt
to describe all sorts of molbio information. The latter version
contains presentation elements for their browser.
- chadoxml, a yet to be released xml version of the forthcoming Generic Model Organism Database Consotium's database schema.
- clone-annotation -
Ole Bents' (formerly of MIPS, now at SD&M) specification
of clone annotation elements
- DAS - The one with the strange acronym, the Distributed Sequence Annotation System that is a sort of Napster for annotations. At this point it has several browsers. It has several DTDs, for data source documents(DSN), sequence entry points (SEP), DNA data (DNA), seq and annotation resolving (RES), annotation types (TYPES), features (GFF), and style (STYLE).
- DDBJ-XML - a format for the Database Japan resources, complemented by WSDL and SOAP transaction features at thesame Web site.
- GAME - UC
Berkeley derived series of Genome Annotation
Markup Elements DTDs
- MaxML - A RIKEN developed format for mouse annotation
- RNAML - born of widespread community collaboration, it is a language to describe the structure and sequence on RNA molecules.
- TIGRXML - the format in which TIGR annotations are distributed, e.g. Arabidopsis.
- XFF - An Extensible Feature Format proposed by Thomas Down.
Protein Specific
- abML - an antibody description language
- InterPro - An integrated documentation resource for protein families, domains and functional sites is available in XML format.
- PSAML - description language for protein secondary structure and relationships
- ProteinDatabase - The PIR database is downloadable in XML format, as
are individual entries when that format is selected from their search page.
- PROXIML - The PROtein eXtensIble Markup Language by Douglas C. McArthur, which is still in the works at the moment.
- SP-ML - An XML representation of SwissProt records
Analysis
Physical
Expression (old, many supplanted by GEML)
Taxonomy/Ontology
Miscellaneous
- ArticleSet - PubMed
now offers XML formatted versions of medical abstracts.
- The BioMOBY project, which deals with lightweight data interoperation and
service discovery in bioinformatics, is developing XML Schema-based objects for interchange.
- BioNOME - The San Diego Supercomputer Center's
two simple DTD's (lookup and dataset) for the
bio-computational models and observational data they are trying to capture.
- GPML, GENIA Project Markup Language used to annotate the admirable Genia medline abstract corpus (corpus = collection of writings used for linguistic analysis).
- A company called LabBook has an XML based lab book and genomic data
suite. Bob Rumpf has updated me, and says that the "DTD available for download free of charge, as
well as a Genomic Viewer which takes Genbank files, converts them to XML,
and visualizes them as an interactive unit. The Viewer is a freeware
version of our commercial package, the Genomic Browser, which has many
significant enhancements and adds functionality not found in the Viewer".
- LinkOut - specification format for the NCBI's LinkOut provider services
- NCBI now in fact has a whole slew of DTDs generated from ASN.1 specs for MMDB, Medline, etc.
- XDF - A scientific extensible data format which is being
uised as the basis for other XMLs, and has an API in several languages. Key features are a hierarchical data storage, and
easy wrapping of tagged, multi-dimensional data.
- XMLMARC, a Standford project encoding MARC bibliographic records as XML.
Has an open source API.
- XSIL - Folks at CACR CalTech have made a dubiously (IMHO) useful
Extensible Scientific Interchange Language to store hierarchical
data. Compare with GXD - 'a "lingua franca" for scientific information'.
Generic
Viewers
This section needs a lot of work. Please mail me with your findings!
- Omnigene, a Java ensembl viewer being developed with the people at Whitehead; DASified.
- Bluejay: coming soon, a generic XML graphical browser
tailored to plotting data on a linear scale (e.g. DNA)
XML data on the Web
- Kaiser Yang has a great site showing XML
transformations between biological markup languages, something we're doing in Bluejay too.
- XEMBL , an EBI project to make their data available in XML format. At last check this included BSML and AGAVE.
- Proteometrics' Proteome
Template Library
- Alan Robinson's
XML Workshop for biological data
XML development tools
The Perl and Biology XML community
XML reference materials (local)
Please note that it is ill advised to base software on
W3C reports
until they are recommendations. So far there are five.
Paul Gordon
Calgary Financial Advisor