...

An XML-based Database of Molecular Pathways Final thesis by

by user

on
Category:

children

1

views

Report

Comments

Transcript

An XML-based Database of Molecular Pathways Final thesis by
Final thesis
An XML-based Database of Molecular
Pathways
by
David Hall
LITH-IDA-EX--05/051--SE
2005-06-02
Final thesis
An XML-based Database of Molecular Pathways
by David Hall
LITH-IDA-EX--05/051--SE
Supervisor : Dr. Lena Strömbäck
Dept. of Computer and Information Science
at Linköpings universitet
Examiner : Dr. Lena Strömbäck
Dept. of Computer and Information Science
at Linköpings universitet
Avdelning, Institution
Division, Department
Datum
Date
ADIT,
Dept. of Computer and Information Science
581 83 LINKÖPING
Språk
Language
Rapporttyp
Report category
2 Svenska/Swedish
4 Engelska/English
2 Licentiatavhandling
4 Examensarbete
2 C-uppsats
2 D-uppsats
2
2 Övrig rapport
2005-06-02
ISBN
—
ISRN
LITH-IDA-EX--05/051--SE
Serietitel och serienummer ISSN
Title of series, numbering
—
2
URL för elektronisk version
http://www.ep.liu.se/exjobb/ida/2005/dd-d/
051/
Titel
En XML-baserad databas för molekylära reaktioner
Title
An XML-based Database of Molecular Pathways
Författare
Author
David Hall
Sammanfattning
Abstract
Research of protein-protein interactions produce vast quantities of data
and there exists a large number of databases with data from this research. Many of these databases offers the data for download on the
web in a number of different formats, many of them xml-based.
With the arrival of these xml-based formats, and especially the standardized formats such as psi-mi, sbml and Biopax, there is a need for
searching in data represented in xml. We wanted to investigate the
capabilities of xml query tools when it comes to searching in this data.
Due to the large datasets we concentrated on native xml database systems that in addition to search in xml data also offers storage and
indexing specially suited for xml documents.
A number of queries were tested on data exported from the databases
IntAct and Reactome using the XQuery language. There were both
simple and advanced queries performed. The simpler queries consisted
of queries such as listing information on a specified protein or counting
the number of reactions.
One central issue with protein-protein interactions is to find pathways, i.e. series of interconnected chemical reactions between proteins.
This problem involve graph searches and since we suspected that the
complex queries it required would be slow we also developed a C++
program using a graph toolkit.
The simpler queries were performed relatively fast. Pathway searches
in the native xml databases took long time even for short searches while
the C++ program achieved much faster pathway searches.
Nyckelord
Keywords
XML, native XML databases, XQuery, protein-protein interactions,
pathway search
iv
Abstract
Research of protein-protein interactions produce vast quantities of data and
there exists a large number of databases with data from this research. Many
of these databases offers the data for download on the web in a number of
different formats, many of them xml-based.
With the arrival of these xml-based formats, and especially the standardized formats such as psi-mi, sbml and Biopax, there is a need for
searching in data represented in xml. We wanted to investigate the capabilities of xml query tools when it comes to searching in this data. Due to
the large datasets we concentrated on native xml database systems that
in addition to search in xml data also offers storage and indexing specially
suited for xml documents.
A number of queries were tested on data exported from the databases
IntAct and Reactome using the XQuery language. There were both simple
and advanced queries performed. The simpler queries consisted of queries
such as listing information on a specified protein or counting the number
of reactions.
One central issue with protein-protein interactions is to find pathways,
i.e. series of interconnected chemical reactions between proteins. This
problem involve graph searches and since we suspected that the complex
queries it required would be slow we also developed a C++ program using
a graph toolkit.
The simpler queries were performed relatively fast. Pathway searches
in the native xml databases took long time even for short searches while
the C++ program achieved much faster pathway searches.
Keywords : XML, native XML databases, XQuery, protein-protein interactions, pathway search
v
vi
Acknowledgements
I would like to thank the iislab group at ida for the opportunity to do
this thesis work, for welcoming me and giving an insight into the academic
world. I especially would like to thank my supervisor and examiner Lena
Strömbäck for her guidance and constructive comments throughout the
work.
I would also like to thank my co-worker Anders Bovin (who did a related
thesis work at the same time as me). Discussing problems and alternative
solutions as well as things not related at all with the subject of bioinformatics and databases have been much appreciated.
Finally I would like to thank my opponent Mikael Albertsson for his
valuable feedback.
vii
viii
Contents
1 Introduction
1.1 Background . . . . . .
1.2 Problem overview . . .
1.3 Purpose . . . . . . . .
1.4 Thesis outline . . . . .
1.5 Document conventions
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
2 XML - Extensible Markup Language
2.1 Background . . . . . . . . . . . . . .
2.1.1 Advantages of XML . . . . .
2.1.2 Drawbacks with XML . . . .
2.1.3 Data vs. document . . . . . .
2.2 Validation . . . . . . . . . . . . . . .
2.2.1 DTD . . . . . . . . . . . . . .
2.2.2 XML Schema . . . . . . . . .
2.2.3 Relax NG . . . . . . . . . .
2.3 Namespaces . . . . . . . . . . . . . .
2.4 Meta-data . . . . . . . . . . . . . . .
2.4.1 RDF . . . . . . . . . . . . . .
2.4.2 OWL . . . . . . . . . . . . .
2.5 Links . . . . . . . . . . . . . . . . . .
2.6 XML APIs . . . . . . . . . . . . . .
2.7 Transforms . . . . . . . . . . . . . .
2.7.1 XSLT . . . . . . . . . . . . .
ix
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
1
1
1
2
2
3
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
5
5
7
8
9
10
10
11
12
14
15
16
16
16
17
19
19
x
CONTENTS
2.8
Query . . . . . . . . . . . . . . . . . . . . . . . . .
2.8.1 XPath . . . . . . . . . . . . . . . . . . . . .
2.8.2 History . . . . . . . . . . . . . . . . . . . .
2.8.3 Update capabilities . . . . . . . . . . . . . .
2.8.4 XQuery . . . . . . . . . . . . . . . . . . . .
2.9 Databases in XML . . . . . . . . . . . . . . . . . .
2.9.1 Different types of XML databases . . . . .
2.9.2 Native XML databases . . . . . . . . . . . .
2.9.3 Indices . . . . . . . . . . . . . . . . . . . . .
2.9.4 Normalization . . . . . . . . . . . . . . . . .
2.9.5 Referential integrity . . . . . . . . . . . . .
2.9.6 Performance . . . . . . . . . . . . . . . . .
2.9.7 Output/API . . . . . . . . . . . . . . . . .
2.9.8 NXD Models . . . . . . . . . . . . . . . . .
2.9.9 Implementations of native XML databases .
2.10 Summary . . . . . . . . . . . . . . . . . . . . . . .
3 Bioinformatics
3.1 Genes . . . . . . . . . . . . .
3.2 Proteins . . . . . . . . . . . .
3.3 Pathways . . . . . . . . . . .
3.4 Experimental methods . . . .
3.4.1 Two-hybrid systems .
3.4.2 Phage-display systems
3.4.3 Curated data . . . . .
3.5 Databases . . . . . . . . . . .
3.5.1 KEGG . . . . . . . . .
3.5.2 DIP . . . . . . . . . .
3.5.3 MINT . . . . . . . . .
3.5.4 BIND . . . . . . . . .
3.5.5 Reactome . . . . . . .
3.5.6 IntAct . . . . . . . . .
3.6 Proposed standard formats .
3.6.1 SBML . . . . . . . . .
3.6.2 PSI MI . . . . . . . .
3.6.3 BioPAX . . . . . . . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
19
19
20
21
21
26
27
28
30
31
32
32
32
33
33
35
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
37
37
38
38
39
39
39
40
40
41
41
42
42
42
42
43
43
45
46
CONTENTS
3.7
.
.
.
.
.
47
47
47
47
48
.
.
.
.
.
.
.
.
.
51
51
51
52
52
53
53
55
55
56
.
.
.
.
.
.
.
.
.
.
.
.
.
59
60
60
60
61
61
61
61
62
62
62
65
65
66
6 GTL test setup
6.1 The GTL package . . . . . . . . . . . . . . . . . . . . . . .
6.2 Transformation . . . . . . . . . . . . . . . . . . . . . . . . .
6.3 The program . . . . . . . . . . . . . . . . . . . . . . . . . .
69
69
69
70
3.8
Proprietary exchange
3.7.1 KGML . . . .
3.7.2 XIN . . . . .
3.7.3 BIND . . . .
Summary . . . . . .
xi
formats
. . . . .
. . . . .
. . . . .
. . . . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
4 Problem analysis
4.1 Questions to be answered . . . . . . . . . .
4.1.1 Query capability . . . . . . . . . . .
4.1.2 Efficiency . . . . . . . . . . . . . . .
4.2 Chosen datasets . . . . . . . . . . . . . . . .
4.2.1 Databases . . . . . . . . . . . . . . .
4.2.2 Queries . . . . . . . . . . . . . . . .
4.3 Chosen technologies . . . . . . . . . . . . .
4.3.1 Native XML databases and XQuery
4.3.2 The Graph Template Library . . . .
5 Native XML database setup
5.1 Native XML databases . . . . . . . .
5.1.1 Exist . . . . . . . . . . . . . .
5.1.2 Sedna . . . . . . . . . . . . .
5.1.3 X-Hive . . . . . . . . . . . . .
5.1.4 Qizx/open . . . . . . . . . . .
5.1.5 Java . . . . . . . . . . . . . .
5.1.6 Machine setup . . . . . . . .
5.2 Queries . . . . . . . . . . . . . . . .
5.2.1 Type of queries and efficiency
5.2.2 Description of queries . . . .
5.2.3 XML serialization . . . . . .
5.3 Test framework . . . . . . . . . . . .
5.4 Benchmarking . . . . . . . . . . . . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
xii
CONTENTS
.
.
.
.
71
71
72
73
7 Results
7.1 Queries on IntAct data . . . . . . . . . . . . . . . . . . . . .
7.2 Queries on Reactome data . . . . . . . . . . . . . . . . . . .
7.3 Premature technique . . . . . . . . . . . . . . . . . . . . . .
75
75
77
79
8 Discussion
8.1 Conclusions . . . . . . . . . . . . .
8.2 Future work . . . . . . . . . . . . .
8.2.1 More formats . . . . . . . .
8.2.2 Data integration . . . . . .
8.2.3 Data integration with OWL
8.2.4 Using live remote data . . .
8.2.5 XQuery graph support . . .
8.2.6 User interface . . . . . . . .
81
81
82
82
83
83
83
84
85
6.4
6.3.1 Removal of extraneous edges . . . . . .
6.3.2 Control of reachability and leaf deletion
6.3.3 Path search . . . . . . . . . . . . . . . .
Benchmark methods . . . . . . . . . . . . . . .
Bibliography
Appendix
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
87
103
A System specifications
103
A.1 Software . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 103
A.2 Hardware . . . . . . . . . . . . . . . . . . . . . . . . . . . . 103
B Figures
C File
C.1
C.2
C.3
105
listings
109
Transformation . . . . . . . . . . . . . . . . . . . . . . . . . 109
IntAct XQueries . . . . . . . . . . . . . . . . . . . . . . . . 110
Reactome XQueries . . . . . . . . . . . . . . . . . . . . . . . 113
Chapter 1
Introduction
This chapter presents a motivation for this thesis work, followed by a description of the problem to be solved and the actual objectives. Finally a
section is dedicated to the structure of this report.
1.1
Background
The research of protein-protein interactions is an area that, like other within
molecular biology, produce vast quantities of data. There exist a number of
different databases with information on protein interactions but they often
have incompatible data formats. This makes it harder to make use of all
publicly available data.
The need for exchange of this data has resulted in at least three different
proposals for standards: sbml, psi mi and Biopax, adding to that several
exchange formats for specific databases has emerged. Many of these are
based on xml (Extensible Markup Language).
1.2
Problem overview
Despite the fact that these formats solve the problem with extracting data
from different databases, the problem with different data being available
1
2
1.3. Purpose
in different formats remains. To make use of the information contained in
these files, integration of data from different sources and search functionality are needed.
A traditional method would be to import the data into a database for
further processing. These operations however can be done directly on the
xml files, thus eliminating the need for database import.
1.3
Purpose
The purpose of this thesis work is to investigate how search and discovery
can be done directly on xml files containing information on protein-protein
interactions. As a part of this work we want to:
ˆ Test different proposed standard formats for protein-protein interactions.
ˆ Evaluate different existing xml tools and their applicability on protein interactions.
ˆ Build a basis for a working demonstration system.
1.4
Thesis outline
Chapter 2 and 3 contains a literature study of the areas of xml and bioinformatics respectively; both these chapters have a summary at the end.
In Chapter 4 the problem is analyzed and Chapter 5 describes the testing
system using native xml databases. In Chapter 6 the implementation of
a specialized graph search program is discussed, followed by a presentation and evaluation of the results in Chapter 7. Chapter 8 summarizes the
report and give examples of future work.
The reader of this thesis is expected to have good knowledge about
databases and the World Wide Web and some knowledge in biology.
Introduction
1.5
3
Document conventions
Text in proportional typeface (example) denotes data contents, function
names or operators.
4
1.5. Document conventions
Chapter 2
XML - Extensible
Markup Language
This chapter introduces some background on the xml format as well as
information on xml databases and related technologies such as xml query
languages.
2.1
Background
xml, Extensible Markup Language [W3C98], is a document format devised
by the World Wide Web Consortium in the second half of the 1990’s. The
format is more general than html [W3C95] (Hypertext Markup Language)
and allows storage of all types of documents, not just hypertext documents
for the web, as well as other types of data storage. In xml, unlike html,
content is separated from presentation. Both xml and html are built on
sgml [ISO86] (Standard Generalized Markup Language), an international
standard for a metalanguage describing languages for the markup of electronic texts in a device-independent and system-independent way. Both
xml and html documents are conforming sgml documents. The reason
for designing xml as a subset of sgml may have been to gain acceptability
within the community but some [Kay03] think the main benefit of xml
5
6
2.1. Background
being just a small subset of sgml is the simplicity of the specification.
xml is used as a metalanguage, xml-based formats are specializations
of xml, where just a few things are defined. Unlike with html the user
are not given a fixed set of tags to use, instead they can define their own
tags to describe the data. There exist a number of xml-based standards
where an authority within the field has specified what tags to use and in
what order.
By design, only a few things are specified for xml itself: enclosing of
elements (all tags must be closed), how attributes are defined, how to refer
to namespaces, naming (what characters are allowed in a tag name). These
rules are called a grammar and documents following this grammar are said
to be well-formed . An xml document following a specification (such as dtd
or xml Schema) of allowed tag names and values are said to be validating.
An xml document contains a number of elements. These elements can
be nested and since a single root element is required, an xml document is
essentially a tree. Every element must be closed, which means it is easy to
see not just where an element starts but also where it ends. This means
that the document can be parsed without special knowledge of the tags.
Listing 2.1: xml example
<?xml version="1.0"?>
<document>
<toc />
<chapter numbering="no">
<section>Introduction to...</section>
</chapter>
<chapter>
<title>My 2nd chapter</title>
<section>
Continuing the text...
</section>
</chapter>
</document>
The first line of an xml document (as in Listing 2.1) is an xml declaration specifying the version of xml being used. Document is the root
XML - Extensible Markup Language
7
node and the level under it consists of one toc element and two chapter
elements. Here the element toc has no content and is therefore opened and
closed in the same tag by finishing off with a slash before the end bracket,
>. Since xml can be used as a document format the order is important.
2.1.1
Advantages of XML
There are a number of advantages of xml. Most of them have to do with
the structure, both logical and visual, which makes reading of the files
easier for humans and machines and allows extensions without breaking
compatibility. These are a number of advantages as described in [Hol01,
MBK+00, CRZ03]:
Structure The requirement that every element must be closed means that
the document can be parsed without special knowledge of the tags.
The structure of data stored in xml format can be anywhere from
highly structured data to semi-structured data as well as unstructured
documents.
Machine readable xml was designed to be easy for computers to process.
The strict limitations of the syntax make the file easy to parse without
ambiguity. Using xml Schema information about data types and rdf
meta-data (see Section 2.4.1) further processing of the information
can be done automatically without human intervention.
Human readable Using sensible tag-names for elements give an self-describing markup, called auto-descriptive, making it possible to read
and even write directly in xml format. This make it possible to
create documents for where there does not exist authoring or viewing
applications yet. Introduction of too many attributes or namespaces
may however make the xml file hard to read.
Extensible If an xml-based standard is missing tags for expressing some
data the document can be extended by introducing a new namespace.
The new tags then use a name space different from the default name
space used by the rest of the document. Old tools will still be able to
read the files and just skip the information it does not understand.
8
2.1. Background
Extern links Instead of having to include all information in the same
document, other xml documents can be referenced by linking. This
document can be a neighboring file or a document from a totally
different organization.
Transforms Tools, especially xslt, for transforming xml data allow easy
transformation of data from one format to another, thus the world
does not need to agree on one common schema but organizations can
develop their own and use transforms to the schema the other party
is using when communicating. Transforms are also an important
reason to why xml is being adopted in the publishing industry for
information that is distributed on a number of mediums, e.g. news
being published both on paper and on the web.
Standard W3C is a organization recognized for developing recommendations for the World Wide Web. Many of the information technology
industry’s larger companies are members of W3C. There is a widespread support for xml, mainly because of xml’s simplicity. Many
standards using xml have been created as well as tools for creating,
viewing and processing xml files, some of them described later in this
chapter.
2.1.2
Drawbacks with XML
The most obvious drawbacks with xml arise from its non-binary, verbose
contents. While it makes the data easier to read it leads to problems.
Verbose One main disadvantage with xml is its verbosity [Hol01]. The
requirement of start tags and end tags for delimiting data gives a large
overhead. This results in high memory demand and slow transmission
times. Therefore xml is not suitable for large data sets, such as
seismic data [Cok05] or data from computer tomography. Binary
extensions for xml are under development and will probably fix some
of these issues.
Messy With complex documents it can be hard to manually read and
edit in xml-based formats. One indication of this is the alternative
XML - Extensible Markup Language
9
compact syntaxes that has emerged, such as Notation 3 [BL01] for
rdf data and Relax ng Compact syntax [rel02].
Slow Mainly due to the verbosity but also because of the lack of physical
pointers [Bou04] within the file it is slow to parse xml files. By
using external indices xml parsers can speed up continuous parsing,
as described in Section 2.9.3.
2.1.3
Data vs. document
The world of xml and databases has sometimes been divided into datacentric and document-centric parts by the xml community [Bou04]. Since
xml documents are not strictly data- or document-centric, but somewhere
in between, this is a skewed view.
Data-centric documents use xml purely as a data transport; they are
designed to be read by a computer. Data-centric documents have a fairly
regular structure, fine-grained data and little or no mixed content. The
order of sibling elements is almost never significant.
Most often data from xml files is stored in a traditional database and
imported and exported by third-party middle-ware or by functions built
into the database (xml-enabled).
Document-centric documents are often designed to be read by humans
as well as computers. They have a less regular structure, the data is coarser
grained and usually have mixed content. The order of sibling elements is
often important. Document-centric documents are often handwritten in
some format and converted to xml or written in xml directly. The data
do not originate from a database.
The distinction between data- and document-centric documents is not
always clear. Many files, e.g. genetic and other biological data, are datacentric but semi-structured. Semi-structured data are irregular and can
have a rapidly changing structure, a case where relational databases are
not suitable. xml’s capability of storing semi-structured data (as well as
highly structured and unstructured data) is one of its strengths.
10
2.2. Validation
2.2
Validation
If an xml document conforms to a schema it is said to be valid. A schema
specifies what an xml document of the current sort should look like. There
are at least four different levels [vdV01] of validation that are implemented
to various degrees in schema languages:
ˆ Validation of markup - the structure of the document.
ˆ Validation of content in individual nodes (datatyping).
ˆ Validation of integrity of links between nodes and between documents.
ˆ Other tests (“business rules”).
Of the four support for link validation, especially between documents, is
poor in most schema languages.
2.2.1
DTD
The xml version of dtd (Document Type Definition) is a simplified version
of dtd found in the sgml standard and thus has an xml-like but nonxml syntax. It does not support namespaces and it has a weak datatype
system that only works for attributes. dtd can check internal links within
a document.
Listing 2.2: dtd example
<!ELEMENT
<!ELEMENT
<!ELEMENT
<!ELEMENT
<!ELEMENT
<!ATTLIST
document
(toc,chapter+)>
toc
EMPTY>
chapter
(section|title){*}>
section
(#PCDATA)>
title
(#PCDATA)>
chapter numbering (yes|no) "yes">
This dtd describes the structure of the xml document in Listing 2.1.
A valid document according to the dtd can contain the root element
document which in turn consists of one toc element and one or more
XML - Extensible Markup Language
11
chapter elements. The toc element is empty, while the chapter element
consists of several section or title elements. The chapter elements also
have an attribute, numbering with the default value of “yes”.
2.2.2
XML Schema
xml Schema [W3C01b] is a successor to dtd, developed by W3C, with
support for namespaces, a richer datatyping system and extensible vocabularies. It is, however, hard to learn and hard to use. xml Schema is
complex, has a large number of features and an xml syntax with a large
number of attributes and heavy nesting. The complexity makes it hard to
use. xml Schema supports integrity check of internal links in documents.
Listing 2.3: xml Schema example
<?xml version="1.0" encoding="UTF-8" ?>
<xs:schema xmlns:xs="http://www.w3.org/2001/XMLSchema">
<xs:element name="document">
<xs:complexType>
<xs:sequence>
<xs:element ref="toc" />
<xs:element ref="chapter" maxOccurs="unbounded" />
</xs:sequence>
</xs:complexType>
</xs:element>
<xs:element name="chapter">
<xs:complexType>
<xs:choice>
<xs:element ref="section" />
<xs:element ref="title" />
</xs:choice>
<xs:attribute name="numbering" type="xs:NMTOKEN" use="
optional" default="yes"/>
</xs:complexType>
</xs:element>
12
2.2. Validation
<xs:element name="section">
<xs:complexType mixed="true" />
</xs:element>
<xs:element name="title">
<xs:complexType mixed="true" />
</xs:element>
<xs:element name="toc" type="xs:string" />
</xs:schema>
The xml Schema in Listing 2.3 describes the same document structure
as the dtd in Listing 2.2. The difference, except for the more verbose
syntax is that the xml Schema lacks a definition of allowed values for the
numbering attribute.
2.2.3
Relax NG
Relax ng [Rel01] is a schema language based on Relax, a product of a
Japanese iso standard technical report written by Murata Makoto, and
trex by James Clark, who was technical lead for the xml 1.0 recommendation, co-author of the xsl and XPath recommendations and has developed
a number of sgml and xml parsers. Relax ng is currently being developed
into an iso standard.
The language was developed to be “simple, easy to learn, use xml syntax, . . . support namespaces. . . and can partner with a separate datatyping
language” [Cla03]. It does not support integrity check of links except by
using features of an external datatype system, such as W3C’s xml Schema.
Relax ng is easier to grasp than xml Schema, much because of the
separation of structure and datatypes.
There is also an alternative Relax ng syntax, the Compact syntax [rel02,
Fit02], that is a non-xml syntax in some ways similar to the dtd syntax.
A schema written in compact syntax can be translated to Relax ng xml
syntax.
XML - Extensible Markup Language
Listing 2.4: Relax ng example
<?xml version="1.0"?>
<grammar xmlns="http://relaxng.org/ns/structure/1.0">
<start>
<element name="document">
<optional>
<element name="toc">
</element>
</optional>
<oneOrMore>
<element name="chapter">
<attribute name="numbering">
<text />
</attribute>
<element name="title">
<text />
</element>
<zeroOrMore>
<element name="section">
<text />
</element>
</zeroOrMore>
</element>
</oneOrMore>
</element>
</start>
</grammar>
Listing 2.5: Relax ng Compact example
element document {
element toc { empty }?,
element chapter {attribute numbering { text }?,
element title { text }?,
element section { text } }+
}
13
14
2.3. Namespaces
Listing 2.4 shows that the Relax ng syntax resembles the original xml
document. Another obvious difference to xml Schema is how the number
of allowed elements is defined, it is possible to define this in Relax ng
using the occurs attribute (like the xml Schema maxOccurs) but using
the shorthand zeroOrMore is often easier. The compact syntax in Listing
2.5 somewhat resembles the dtd in Listing 2.2, but here nesting is used to
determine which elements and attributes that are allowed in other elements.
Neither of the two Relax NG examples have default attribute values, this
because attribute defaults cannot be defined in Relax ng (except with a
special extension using namespaces) [Smi01, Cla01].
2.3
Namespaces
A namespace [Sri][W3C99a] is used to avoid naming conflicts when several vocabularies are used simultaneously. Namespaces are used in xml
technologies like xslt (to discriminate xslt’s own tags from tags to be
outputted in the resulting document). The namespace is a vocabulary of
elements and attributes where an iri (Internationalized Resource Identifier,
an internationalized version of uri) is used to identify the namespace.
A namespace can be declared anywhere in the document. If declared
in the root node it will apply to the entire document, if declared elsewhere
it will apply to the element where it is declared and element nested within
that element. The declaration binds a prefix to the namespace, except for
the default namespace declaration which defines what namespace to use
when a prefix is missing. The prefix is then used to connect namespaces
with elements and attributes.
To define what elements and attributes there are in a namespace an
xml Schema Definition (xsd) can be written.
Listing 2.6: Namespace example
<?xml version="1.0"?>
<document xmlns="http://www.ida.liu.se/doc"
xmlns:pic="http://www.ida.liu.se/pic">
<toc />
<chapter numbering="no">
XML - Extensible Markup Language
15
<section>Introduction to...
<pic:figure>
<pic:type format="postscript" version="3.0" />
<pic:input file="logo.ps" alt="IDA logo" />
</pic:figure>
</section>
</chapter>
<chapter xmlns:pic="http://www.dpg.se/pictures">
<title>My 2nd chapter</title>
<section>
Continuing the text...
<pic:figure file="diagram.svg">An interesting figure</
pic:figure>
</section>
</chapter>
</document>
In Listing 2.6 the xml file is associated with the default namespace uri
”http://www.ida.liu.se/doc”. Note that nothing has to exist at that
uri as long as it is unique so that implementations can determine what
type of document the xml is. A prefix, pic, is associated with a second
namespace ”http://www.ida.liu.se/pic”. This namespace is then used to
insert new elements in the first chapter. In the second chapter the pic prefix
is overridden by another namespace ”http://www.dpg.se/pictures”. This
namespace will be used when using the pic prefix in elements beneath this
chapter element since it is inherited.
2.4
Meta-data
Meta-data is data about data, in this context machine understandable information. The data is used to identify relationships between different
resources, e.g. who has written a certain article. This data can in turn be
related to other data, e.g. the address of the author or the affiliation of the
author.
16
2.4.1
2.5. Links
RDF
rdf [W3C04e, W3C04d] (Resource Description Framework) is a way
to express meta-data of web resources that can be used between different
applications and are one of the cornerstones of the semantic web. rdf
is developed by W3C and is often serialized as xml. rdf consists of a
syntax specification and a schema specification (rdfs). The description
itself has its own uri, making it possible to describe rdf descriptions in
rdf. The knowledge expressed is written as triples. A triple consists of a
resource (subject), a property (predicate) and an object. The resource (e.g.
the article) is a uri (Uniform Resource Identifier) reference, thus rdf can
be used to express the semantics of anything that has a uri. A property
(e.g. author) is also a resource making it possible for the property to have
properties itself. The object can be a value (e.g. the name of the author)
or a resource (e.g. the home-page of the author).
rdf descriptions can be embedded directly in an xml document or
provided separately.
2.4.2
OWL
owl [W3C04c] (Web Ontology Language) builds upon rdf and adds relations between classes, cardinality, equality, symmetry and so on. owl is
constructed to build ontologies of different subjects with the web in mind.
Like rdf owl was developed by W3C. There are three type of owl documents: Lite, dl (Description Logic) and Full. All rdf documents are also
valid owl Full documents. owl dl is a more restricted subset of Full, and
Lite is in turn a more restricted subset of Lite. Making it easier to grasp
and easier to provide tools for. For example Lite only allows cardinalities
of 0 or 1.
2.5
Links
Links are the main functionality of the web; they make it possible to connect
one resource to another. Links are useful not just in hypertext but also for
connecting other types of information. This can be used to reduce the
XML - Extensible Markup Language
17
amount of redundant data by linking to common information resources, to
link to resources that have access to up-to-date data.
XLink [W3C01a] provides a way to link between all types of xml documents. Unlike links in a html document, that only offers links from the
current document, XLink allows links both to and from a document.
2.6
XML APIs
Since parsing of xml is an essential part in all applications that handle xml
data the need for xml apis is inevitable. An api (Application Programming
Interface) is used to access data in xml files from within a programming
language. There are a number of types of xml apis as identified by Elliotte
Rusty Harold [Ven03]. He identifies five different types, or styles, of xml
apis :
Push API, or event-based api, was the first type of xml apis invented
because it was easy to implement. They are streaming apis where the
document is read in document order and an event is triggered when
an element starts or ends and when data starts. The programmer
has to implement code for these events that make use of the data.
It has low memory constraints and is very fast but is on the other
hand complex to use, especially if parent-children element relations
is to be accessed. This is because you cannot access a specific node the document is read in document order only. If you want to access
a previous sibling or parent the document must be reread from start.
The most famous push api is sax [SAX] (Simple api for xml). It
began as an xml api for Java and is now a de facto standard for pushbased xml parsing. Another push api is Xerces Native Interface.
Pull API is the newest type of xml api and is, like push api, a streaming
api. The difference to push api is that the application asks, pulls,
the parser for new information instead of the parser calling functions
when new information is found. The pull api is also very fast and
very memory efficient but is simpler to use than the push api. Being
the newest type of api, the implementations have not matured much
yet.
18
2.6. XML APIs
Tree-based API builds an object model, often a tree with nodes for elements, attributes, text, and so on of the xml documents. This tree
can the be queried, navigated and modified via the api. It is an effective way to process the contents but since the tree usually must be
kept in internal memory it is impossible to use for larger documents.
Some applications, however, such as some xml databases, have treebased apis that do not require all of the tree to be in internal memory. The most known tree-based api is W3C’s recommendation dom
[W3C04a](Document Object Model) and jdom [HM04], a Java-based
document object model.
Data-binding API, like tree-based api, parses the entire document and
builds an object model. But here the model represents actual data
instead of elements, attributes and so on. in the xml document.
An object in a data-binding api can be a chapter class (compare
with Listing 2.1) corresponding in some way to some element with
sub-elements and attributes in the xml document. The mapping
between the xml document and the resulting objects is defined in
some sort of schema: W3C’s xml Schema language, dtd or a special
binding schema language. The problem with data-binding apis is
that most xml document do not have a corresponding xml Schema,
have a schema in a format other than W3C’s Schema language (that
most apis assume) or that the document does not follow the specified
schema. Another problem is that data-binding apis assume flat data
structures and that order of elements does not matter, rendering it
unsuitable for all document-centric xml and a large number of datacentric xml.
Query API is an api where queries, like XPath, xslt or XQuery is made
directly over the api. Most work to get the data is put into writing the
queries in the query language itself, not in the native programming
language.
The two most common styles of xml apis are event-based (push api) and
tree-based.
In general, the existing apis are [Ven03] too complicated or too simple.
They do not model xml completely or correctly. The largest problem is
XML - Extensible Markup Language
19
with namespaces (see Section 2.3), an area within xml that is difficult to
understand, and difficult to handle in the apis.
2.7
Transforms
Transformation is the process of transforming an xml document into another. These transforms are often from a device-independent data format
to a specific presentation format, such as xhtml, html or wml. Data can
be filtered and grouped in a transformation, to just give the desired subset
of information.
2.7.1
XSLT
xslt [W3C99c] (xsl transformations) is the transformation language in
the Extensible Style-sheet Language family [W3C04b] (xsl). It is, as most
W3C recommendations, xml-based. XPath (as described in Section 2.8.1)
is an essential part of xslt, it is used to specify which nodes to transform.
xslt supports templates, sorting, grouping and user-defined functions (the
latter two were introduced in xslt 2.0). xslt is designed for transformation
to presentation formats though other types of transformation is possible the
functionality is aimed at presentation transformation and may miss some
features needed.
2.8
Query
Many xml tools such as xslt and query apis use some type of query
language. XQuery is becoming more common, all xml query tools will
probably support XQuery eventually.
2.8.1
XPath
XPath [W3C99b] is like regular expressions (wildcard search patterns used
for searching in text) for xml nodes. It is standardized by W3C and is
used to find nodes in xsl and XQuery among others.
20
2.8. Query
An XPath expression defines the path top-to-bottom through the tree
to the desired nodes. Each node level is separated by a slash, much like
paths used in file systems. To find all the section elements in Listing 2.1
the following XPath expression can be used:
/document/chapter/section
If you want to find all section tags irrespective of where they are in the
tree the expression is:
//section
The double slashes (//) means arbitrary depth. In this case the result
would be the same.
XPath also support function calls and conditionals.
count(//chapter[@numbering="no"])
would return the number of elements where the attribute numbering is
set to “no”. The @ sign indicates that the following name is an attribute.
There is no not-equal operator; this is instead achieved by enclosing the
expression in a function call to not():
count(//chapter[not(@numbering="no")])
XPath can work as a basis for a query language but quite a few features
are missing, like grouping, sorting, cross document joins and data types.
If used within an xsl transform (xslt) the first three of these functions
are provided by xslt. xslt with its xml-based syntax is somewhat hard
to use and probably nothing you would want to use from a programming
environment.
2.8.2
History
A number of querying languages for querying xml files have been developed, some of them originated in the research on semi-structured data such
as oql, Quilt, YaTL, xml-ql, xql and Lorel. A W3C working group is developing an xml query language called XQuery based on Quilt and inspired
by some of the mentioned languages.
XML - Extensible Markup Language
2.8.3
21
Update capabilities
With large amounts of data in xml documents it becomes interesting to be
able to update the data. There are some ways of updating xml documents.
From simply replacing the existing document, to updates on a live dom
tree, to a special update language. Most methods are proprietary but two
common languages have emerged:
ˆ XUpdate [LM00] from xml:db initiative is based on XPath which it
uses to specify which node to delete/update or which nodes to insert
new nodes before or after.
ˆ XQuery extensions have been proposed by W3C’s XQuery Working
Group and Patrick Lehti [Leh01]. Some variations of these are implemented in a number of xml tools.
When update/delete support is added to XQuery it is likely to be supported
by a large number of tools handling xml data.
2.8.4
XQuery
XQuery [W3C04f] is designed to query collections of xml data and has
been called the “last great project of xml standardization” [Dum04]. The
collections can be data from xml files, xml databases or relational databases, on a varying level of structure. Queries can combine data from
several sources and since element order is important in some xml documents, results can be given in element order. XQuery builds upon XPath
which is heavily used in XQuery. The specifications for XQuery 1.0 and
XPath 2.0 are developed by the same working group under W3C and the
final XQuery 1.0 recommendation is believed to be released in the end of
2005 [Dum04]. XPath is nowadays defined as a subset of XQuery. The
non-XPath additions are sql-like, in function and syntax.
XQuery was first called Quilt and was a successor to the xml query
language xql, other influences were xml-ql and sql.
XQuery consists of three different languages. One human-friendly nonxml syntax, one machine-friendly xml-syntax (XQueryX, see Section 2.8.4)
and a formal algebraic language used in the XQuery processor.
22
2.8. Query
XQuery currently does not support group-by operations and thus nested
queries (queries with sub-queries) are important. There are no restrictions
on query nesting in XQuery.
The main concepts in XQuery are the flwor (pronounced “flower”)
expressions. flwor is an acronym for the different expressions for, let,
where, order by and return. For is used to iterate over the nodes, let
is used to assign a variable. Both these expressions specify a sequence of
tuples that can be filtered or ordered using where and order by clauses
respectively. Once the tuples have been filtered and ordered they are returned by the return clause.
flwor can be used to build simple or complex xml fragments and
several documents can be queried simultaneously. The flwor expressions
are separated from the surrounding xml content by curly braces.
flwor has a type of conditional expression that unlike most languages
require an else clause because every expression in XQuery must return a
value.
Listing 2.7: XQuery example
<automated-toc title-chs="{count(//chapter/title)}"
tot-chs="{count(//chapter)}">
<chapter> {//chapter/title} </chapter>
</automated-toc>
Listing 2.8: Results of XQuery in Listing 2.7
<automated-toc title-chs="1" tot-chs="2">
<chapter>
<title>My 2nd chapter</title>
</chapter>
</automated-toc>
Listing 2.7 shows a simple XPath-type query (with results in Listing
2.8) of the number of chapters in Listing 2.1 with title and the number of
chapters in total. The title of the chapter is also printed.
XML - Extensible Markup Language
Listing 2.9: Second xml file
<?xml version="1.0" encoding="UTF-8"?>
<annotations>
<annotation>
<fortitle>My 2nd chapter</for-title>
<reviewer>
<name>David</name>
<email>[email protected]</email>
</reviewer>
<date>2004-10-16</date>
<text>There needs to be a lot more text.</text>
</annotation>
<annotation>
<fortitle>My 2nd chapter</for-title>
<reviewer>
<name>Cecilia</name>
<email>[email protected]</email>
</reviewer>
<date>2004-09-23</date>
<text>Change title to "My second chapter"!</text>
</annotation>
<annotation>
<fortitle>Another chapter</for-title>
<reviewer>
<name>Cecilia</name>
<email>[email protected]</email>
</reviewer>
<date>2004-09-23</date>
<text>This chapter should perhaps be removed.</text>
</annotation>
</annotations>
23
24
2.8. Query
Listing 2.10: XQuery join
<doc-annotations>
{
for $b in doc("exempel.xml")//chapter
return
<part>
<title>{$b/title}</title>
{
for $c in doc("annotation.xml")//annotation
where $c/fortitle = $b/title
order by $c/date
return
<comment from="{$c/reviewer/name}">{$c/text}</comment>
}
</part>}
</doc-annotations>
Listing 2.11: Result of XQuery join
<doc-annotations>
<part>
<title/>
</part>
<part>
<title>
<title>My 2nd chapter</title>
</title>
<comment from="Cecilia">
<text>Change title to "My second chapter"!</text>
</comment>
<comment from="David">
<text>There needs to be a lot more text.</text>
</comment>
</part>
</doc-annotations>
XML - Extensible Markup Language
25
Listing 2.10 shows a more advanced join query using flwor on the
documents in Listings 2.1 and 2.9. The chapters in the first xml document
are listed together with the corresponding annotations in the second file.
The result is listed in Listing 2.11.
XQuery has a large number of functions, including functions for math,
string manipulation, time comparison, node manipulation, sequence manipulation, type conversion and Boolean logic. All functions are written
in a namespace to avoid name collisions. You can also define your own
functions. There is no support for function overloading. Functions can be
recursive.
Since there is no way to import code it is not possible to call a common
function library for frequently recurring functions.
There is a number of arithmetic and comparison operators. Special
operators are < < and > > that check if an node appears before or after
another node.
XQueryX [W3C03], formerly abql (Angle Bracket Query Language),
is an xml representation of XQuery and is meant to be used by existing
xml tools for parsing, creation or modification. It does not seem to be
used to a great extent in any of the existing implementations and there is
speculation whether it will be included in the XQuery recommendation or
not.
Some of the most well-known stand-alone XQuery implementations are:
XQEngine [Kat04] is an open-source Java component for XQueries.
QuiP [AG] from Software ag is a prototype XQuery engine that allows
queries on xml files in a file system as well as in xml stored in a
Tamino xml Server (see Section 2.9.9). QuiP has not been updated
for the latest recommendation drafts.
XQuark [Gro04]is an open-source project that has released two different
products: Bridge and Fusion. Bridge is a system for expanding relational database systems with import and export capabilities. Fusion
is a system to get an xml view of data sources (including those from
Fusion and Bridge) and perform XQueries on it.
IPSI-XQ [Dar04]is a demonstration implementation of XQuery written at
Fraunhofer ipsi. It has graphical, command-line and web interfaces
26
2.9. Databases in XML
and a Java api. It has been updated up to the November 2003 draft
of the XQuery recommendation.
Qexo [Bot03]is an open-source, partial implementation of XQuery using
the Kawa framework to get queries compiled to Java bytecode for better performance. Many of the standard functions defined in XQuery
are not supported.
Galax [gal04]is an open-source implementation of XQuery developed by
AT&T, Lucent and Avaya in the O’Caml programming language.
Command-line tools, a web interface and apis for O’Caml, C and
Java are available.
Qizx/open is an open-source XQuery implementation. It implements all
features except schema import and validation. A commercial variant
called XQuest with support for indexing is under development.
XQJ is an effort to specify an XQuery Java api based on jdbc. See Section
2.9.7.
Saxon is a Java-based xslt processor that now also supports XQuery.
2.9
Databases in XML
xml has some advantages as a database format, it is self-describing (structure and type names are given in the markup, the semantics is missing
however), it is portable and data can be described in a tree or graph structure.
The main disadvantages are its verbosity and need for parsing and text
conversion leading to slow reading of data.
xml and related tools provide many of the features found in databases:
storage, schemas, query languages and programming interfaces. However, it
lacks efficient storage, indices, security, transactions, data integrity, multiuser access, triggers, queries across multiple documents etc.
These shortcomings can be defeated by storing the xml in an xml
database.
XML - Extensible Markup Language
27
In case the data has an irregular structure and/or uses entities a native
xml database (nxd) can be suitable. Physical document structure is preserved in an nxd, document-level transactions are supported and queries
can be executed on the xml.
2.9.1
Different types of XML databases
Storing database-like xml documents in ordinary xml files may work for
small amounts of data but will fail in most production environments requiring strict data integrity and good performance.
One main issue is indexing - finding an actual element or value inside
an xml file requires reading the whole file if an index is missing.
A better solution is to store the xml data in a database. There are two
different types of xml databases: enabled and native. A third hybrid type
also exists.
XML-enabled databases (xedb) are built on top of relational databases. xml data is mapped to a relational database giving access
to all the features and the performance found in relational database
management systems. Due to xml’s structure this mapping often
gives a large number of tables or an unnormalized representation in
cases where the data is well-structured, not nested in too many levels
and the schema is strict. The mapping also often leads to physical and logical structure (processing instructions, comments, element
order and so on) of the xml data being lost to some degree. This
means that an xml document exported from the database probably
will be different from what was imported. Most of the well-known
database management systems offer xml-enabled databases, such as
ibm’s db2, Oracle’s Oracle 10g and Microsoft’s sql Server.
Native XML databases (nxd) are designed especially to store xml documents. They support transactions, security, multi-user access, programmatic apis, query languages and so on. The only difference from
other databases is the internal model. nxds are most useful for storing document-centric documents since document order, processing
instructions, comments, cdata sections and entities are preserved.
28
2.9. Databases in XML
Not just documents but also data, especially data semi-structured in
nature, can be stored in native xml databases.
Hybrid XML databases (hxd) are databases that, depending on the
features wanted, can be seen as either native xml databases or as
xml-enabled databases. An example is Ozone which allows data access through xml, whose internal data representation maintains full
xml structure and whose fundamental unit of storage is an xml document.
The boundaries between traditional xml-enabled databases and nxds has
blurred. Traditional databases support native xml operations and nxds
can use external relational databases to store document fragments. Standard xml tools can be used for working on the stored documents, these
include dom, sax, XPath, xslt and XQuery. nxds are useful for storing
documents where xml is the native format. An nxd requires no initial
configuration to store an xml document.
xml databases is a younger research field than relational databases and
much work on finding effective algorithms to query tree-structured data is
still to be done.
2.9.2
Native XML databases
The term native xml database was first mentioned in the marketing campaign for Software ag’s product Tamino. The term is in common use but
lacks a formal technical definition.
Software ag’s own definition [CRZ03] requires that a native xml database system is built and designed specifically for the handling of xml.
Another and more open definition developed by members of the xml:db
mailing list [Bou04] is:
Defines a (logical) model for an xml document – as opposed
to the data in that document – and stores and retrieves documents according to that model. At a minimum, the model
must include elements, attributes, pcdata, and document order. Examples of such models are the XPath data model, the
XML - Extensible Markup Language
29
xml Infoset, and the models implied by the dom and the events
in sax 1.0.
Has an xml document as its fundamental unit of (logical) storage, just as a relational database has a row in a table as its
fundamental unit of (logical) storage.
Is not required to have any particular underlying physical storage model. For example, it can be built on a relational, hierarchical, or object-oriented database, or use a proprietary storage
format such as indexed, compressed files.
An nxd can store more information than contained in the model. A document is the fundamental unit of storage in all nxds, although it would be
possible to use document fragments instead.
nxd provides robust storage and a way to manipulate xml documents.
The model of nxds allows arbitrary levels of nesting and complexity. The
meaning of stored data is given by the received document, not by what
is stored in the underlying layer. There is currently no well-defined way
to perform updates in the stored documents. A few proprietary update
languages exist as well as xml:db XUpdate but there is no widespread
language, and there probably will not exist any until a query language is
added to XQuery, see Section 2.8.3 for more discussion on this. nxds excel
at storing document-oriented data, data with very complex structure with
deep nesting and data that is semi-structured in nature.
The reasons for storing xml in nxds are:
ˆ When you have semi-structured data (a regular structure but with
large variations) a large number of null-valued columns or a large
number of tables would be required to store the data in a relational
database. This type of data could also be stored in an object-oriented
or hierarchical database.
ˆ Faster retrieval speed. Some nxds store entire documents together
physically on disk or use physical pointers, allowing documents to
be retrieved faster than with a relational database with data spread
logically as well as physically. If data is retrieved in non-sequential
order there probably will not be any performance boost.
30
2.9. Databases in XML
ˆ If you want to use xml-specific capabilities, such as executing xml
queries. The support of xml query is not well spread yet and xml
query languages are being implemented in relational databases meaning this reason is not very strong.
Most native xml databases can only return data as xml, if the application
needs data in another format it must parse the xml. In a distributed
application (e.g. web services) this is not a problem as the data has to be
distributed serialized in some form, xml being the most common.
xml does not support data types directly, this information can however be specified via a schema, such as xml Schema. Some nxds require
collections to be associated with a schema while others do not.
Some nxds can include remote data in documents. The data is retrieved
from relational database using a mapping.
nxd operates on a set of documents, called a collection. An entire
collection can be accessed as one tree. The collection could be associated
with a schema, specifying the format of the data stored but not all xml
databases requires the collection to be associated with a schema. This gives
higher flexibility and makes development easier but also come with the risk
of low data integrity. Some products support validation with dtd and a
few also with xml Schema.
Transactions are supported in many nxds. Locking is often performed
on document root level because a deletion of a node higher up in the hierarchy would remove the node being updated at the same time. There is
ways to achieve node-level locking but this is somewhat complex and not
yet implemented in any of the nxds investigated.
One of the advantages with nxds are their round-tripping support, i.e.
they can emit xml documents as they once were entered into the database.
2.9.3
Indices
Indices are found in all nxds. There are three types:
ˆ Value indices. These include text and attribute values.
ˆ Structural indices. Indices containing information on location of elements and attributes.
XML - Extensible Markup Language
31
ˆ Full-text indices. The full-text indices are based on text and attribute
values.
Most nxds support the first two types, a few support full-text indexing.
These types can be combined, e.g. a structural value index can be used to
find specific elements containing a specific value. Some of the nxds make
use of xml Schema information about data types when indexing data.
2.9.4
Normalization
Normalization is, just as with relational databases, a concern with xml
databases. The risk of duplicated, redundant, data is present also for xml
[Bou04, Pro02a], resulting in increased file size and risk of inconsistence.
Just as with relational databases you are not required to normalize your
data.
Since the structure of an xml document differs a lot from the flat tables
in relational databases normalization is quite different. xml’s support for
multi-valued fields (one of a number of the xml structure’s deviations from
first normal form) makes it possible to normalize data in a way not possible
with a relational database, since multiple children in a one-to-many relationship can be placed directly under a parent using composition [Pro02a]
without need for multiple tables or foreign keys. If the data is spread over a
number of documents however keys, implemented in e.g. XLink, is needed
[Pro02a, Bou04].
Second and third normal forms can be applied nearly unchanged to
the xml world using xml Schema’s keyref capabilities to act much like a
foreign key. This should only be used for many-to-one and many-to-many
relationships, for one-to-many relationships composition is a better use of
xml’s structure [Pro02b]. In a relational database a primary key must be
unique within its database instance. In xml Schema a key is unique within
the scope of some element instance, thus the scope depends on in which
element the keyref is defined.
Since fourth normal form only is interesting if first normal form is followed it has no meaning in an xml tree. Fifth normal form applies to xml
however [Pro02b].
32
2.9. Databases in XML
2.9.5
Referential integrity
Referential integrity means ensuring that pointers in xml documents point
to valid documents or document fragments. These pointers can be of the
types id/idref attributes, key/keyref fields (as in xml Schema), XLink or
some proprietary mechanism. Referential integrity in nxds can be either
on internal pointers or on external pointers. Most nxds only validate internal pointers at insertion, a few or none guarantee referential integrity
after updates. External pointers are not enforced for integrity since it is
meaningless to enforce integrity to pointers within the same database and
since the database has no way to control external documents they neither
can be enforced. Referential integrity will probably be supported for internal pointers and perhaps also for ”‘external pointers of some sort”’ [Bou04].
Until then it is up to applications to enforce the integrity of pointers.
2.9.6
Performance
If data is retrieved in stored order native xml databases should scale well,
probably even better, than relational databases. This is confirmed by tests
[CRZ03]. If it is retrieved in any other order it will most certainly be slow
and scale poorly. This despite the heavily use of indexing in nxd which
helps retrieval time at the cost of slower updates. For un-indexed data
nxds are clearly outperformed by relational databases.
Further work on specialized algorithms for querying and processing xml
databases such as developed in the Timber project [JAKC+02] will enable
the use of large databases.
2.9.7
Output/API
apis are offered by most nxds. These are often odbc-like and returns xml
strings, dom trees, or a parser over the returned document. Most apis are
proprietary but there exist two well-known vendor-neutral apis:
ˆ xml:db api [Sta01] uses XPath as a query language.
ˆ xqj (XQuery api for Java) [EMA+04] is based on jdbc and is under
development.
XML - Extensible Markup Language
2.9.8
33
NXD Models
Native xml databases can be divided [Bou04] into two large groups depending on how they store data: text-based and model-based.
Text-based NXDs can be ordinary files, blobs in relational databases
or use a proprietary text format as storage. Indices are essential since
they give a tremendous speed advantage when retrieving documents,
requiring just a single index lookup to get the position of the required
document fragment and reading of the bytes in the order it is found. If
you want the data in another structure it is likely to be outperformed
by a relational database.
Model-based NXDs build an internal object model from the document
and store this model. It can be stored in a relational or objectoriented database as well as in a proprietary storage format optimized to the model. Model-based nxds that use a proprietary storage format will have performance similar to that of text-based nxds
when retrieving data in storage order since physical pointers are used.
Model-based nxds are faster at retrieving documents as Document
Object Model (dom, see Section 2.6) trees than text-based. As with
text-based nxds performance problems can occur when data is retrieved in another order.
Persistent DOM is a specialized type of model-based nxd which uses a
dom internally. This makes live dom access possible. Due to the
large memory requirements of dom, this is not practical for large
documents.
2.9.9
Implementations of native XML databases
There exist a number of nxd implementations. This list contains the most
popular databases as well as a number of databases with interesting features.
dbXML [dbx03] is an open-source native xml database. It does not support XQuery.
34
2.9. Databases in XML
Exist (eXist) [exi05] is an open-source Java-based native xml database
that supports XQuery.
Infonyte DB [inf05] is a native xml database based on persistent dom.
For Java. Does not support XQuery yet.
Ipedo XML database [ipe] supports XQuery and transactions. For Web
services, Java and .NET.
MarkLogic Content Interaction Server [mar05] supports XQuery, lockfree queries and configurable levels of document fidelity.
Neocore XMS [neo05] is a transactional native xml database from Xpriori using a “patented pattern-recognition technology” [LLC05]. It
supports XQuery and has apis for Java, C#, C++, Web services etc.
For Linux and Microsoft Windows.
Sedna [sed05] is an nxd developed at the Institute for System Programming of the Russian Academy of Sciences. It supports XQuery and
has an update language. A Scheme api, a C api and a Java api is
available. Sedna is implemented in C/C++ and Scheme. Binaries
are available for Microsoft Windows, it is however said to be possible
to compile on other operating systems such as Linux.
Sonic XML Server [son05] from Sonic Software supports XQuery, document linking, update-grams (a way to update xml documents) with
triggers. For Java.
Tamino [tam04] is a commercial nxd from Software ag and the first product announced as a native xml database. Whether it really has native xml storage is unknown since details on the architecture are
unknown. Available on Linux and Microsoft Windows platforms.
Timber [tim04] developed at University of Michigan are using a special
type of algebra that takes trees as input and give trees as output, to
achieve fast querying. Only Microsoft Windows is supported. Timber
does not support user defined XQuery functions.
XML - Extensible Markup Language
35
Virtuoso [vir04] from OpenLink. Gives access to a virtual database via
xml, odbc, jdbc, .net or ole db. XQuery is supported. Virtuoso
can be used with Web services, .net, Mono and j2ee.
Xindice [xin] is a fork of dbXML developed under the Apache umbrella.
It does not have XQuery support yet.
X-Hive DB [xhi05] is a commercial nxd that supports XQuery and versioning. It can be used with j2ee.
2.10
Summary
xml is a document format that due to its flexibility is used in more and
more areas. To ensure that documents conforms to a standardized schema a
number of validation techniques are available. Documents can be extended
to contain information not defined in the schema.
As with data stored in traditional database systems issues with normalization and performance occur. Due to the different structure in xml
documents compared to relational databases these areas differ remarkably
and are the focus of much research.
XQuery is a query language for querying xml data as well as data from
relational database systems. xml data is also returned as a result of the
queries. XQuery is currently under development by a W3C working group
and is already supported, to different extents, by a number of tools.
Effective and fast apis are necessary to handle large xml documents.
When managing many and/or large xml files it becomes cumbersome to
keep track of all files and queries on the contents are slow. With the help
of xml databases xml documents can be assembled into collections and
indexed. Native xml databases (nxds) are a type of database management
systems with a special architecture to store and retrieve xml data. nxds
are better than other xml databases, such as xml extensions on relational
database management systems (xedb), when data is semi-structured or
you want to retrieve documents as they were stored.
36
2.10. Summary
Chapter 3
Bioinformatics
The advances in molecular biology have resulted in sequencing of the genomes
of several species and thus an amount of data impossible to process manually. By introducing information science to biology, a field called bioinformatics, researchers are able to store and access data easily. This chapter
describes the fundamentals of the central dogma in biology, protein interactions, existing databases and exchange formats.
3.1
Genes
A gene is a description of what amino acids to assemble to build a protein.
Genes are part of dna molecules, large chains of nucleotides. There are
four different types of nucleotides: adenine (a), guanine (g), cytosine (c)
and thymine (t). A combination of three nucleotides, a codon, codes for
one amino acid. Some codons do not code for amino acids but indicate the
beginning or ending of a protein being coded.
To create proteins the part of interest of the dna molecule is copied
to a mrna molecule in a process called transcription, the m stands for
messenger. To this mrna other molecules, called trna, bonds. There are
47 different trna molecules that bond to a specific triplet on one side and
a specific amino acid on the other side.
37
38
3.2. Proteins
There are about 3.08 billion base pairs [Con04] in the human genome.
10 percent of the base pairs code for genes, what function the remaining 90
percent has is unclear. The human genome contains 20,000-25,000 genes
[Con04].
3.2
Proteins
Proteins are long polymers of amino acids. Most proteins consists of somewhere between one hundred and one thousand amino acids. Since there
exist twenty different amino acids the number of possible different protein
structures are enormous, but in reality 15,000 different protein structures
are currently [Les02] known.
Proteins are important for the life of organisms. They have a number
of different roles: structural (such as skin on animals), catalysis (enzymes),
negative catalysis (inhibitors), transportation, regulation, control of genetic
transcription and as participants in immune systems. The main function
of a protein is to bond to different substances. What substance a protein
will bond to depends on the shape of the protein and the distribution of
electrical charges on its surface. What shape a protein will get depends
solely on what amino acids it consists of.
3.3
Pathways
A biochemical pathway (or biochemical network) is a network of interconnected chemical processes within the cell. These are processes between
different molecules, including proteins (protein interactions). There are
different types of pathways. A metabolic pathway is a series of reactions
catalyzed by enzymes. A regulatory pathway is a series of reactions and
interactions regulating the expression and activity of enzymes and transporters. A signal transduction pathway is a series of reactions and interactions realizing transfer of information between different cellular locations,
such as between the extra-cellular medium and the cell nucleus. The difference between regulatory and signaling pathways is not always noticeable.
The mapping of pathways is important to attain knowledge on how the
Bioinformatics
39
proteins coded by genes actually affect the organism. The analysis is tricky:
the amount of information is large, data in databases are heterogeneous,
data can often be incomplete and the potential size of pathways can be
very large.
3.4
Experimental methods
Protein interaction data are determined mainly by two different methods
[Cun01, Twy03], two-hybrid systems and phage-display systems.
3.4.1
Two-hybrid systems
There are a number of two-hybrid systems with yeast two-hybrid systems
being the most common. The yeast two-hybrid system is a method that
uses a yeast protein called gal4 which functions as a transcription factor.
This factor is split up in two parts, one with the dna-binding domain (dbd)
and the other with the activation domain (ad).
The protein sequence whose function is unknown is joined with the
transcription factor’s dna-binding domain and the result is a hybrid protein
called a bait.
The other genes are joined with the transcription factor’s activation
domain to form hybrid proteins known as prey.
The bait is then tested against each prey and possible interactions are
detected since an interaction between the bait and prey results in a functional transcription factor that in turn activates a test gene that can be
detected.
3.4.2
Phage-display systems
Another method is phage-display. The steps, as described by [Twy03], are:
1. The function of protein X is unknown. The protein is used to coat
the surface of a small plastic dish.
40
3.5. Databases
2. All the other genes in the genome are expressed as fusions with the
coat protein of a bacteriophage (a virus that infects bacteria), so that
they are displayed on the surface of the viral particle.
3. This phage-display library is added to the dish. After a while, the
dish is washed.
4. Phage-displaying proteins that interact with protein X remain attached to the dish, while all others are washed away. DNA extracted
from interacting phage contains the sequences of interacting proteins.
3.4.3
Curated data
These methods are susceptible to false positives, e.g. gal4 can interact
with other yeast cell elements and trigger a transcription within the bait
without any prey involved. The data collected in biological databases are
therefore not necessarily correct. Some databases require manual verification of the data, in some cases with references to at least two different
published sources, before submittal while other databases do not require
any verification.
IntAct (see Section 3.5.6) for example allows the authors to express their
confidence in an interaction [EMB05], that can be expressed as attributes
for an interaction in the psi mi format (see Section 3.6.2).
3.5
Databases
Researchers use databases to store experimental data using different models
[DGvHW03]. Database models are used simply for storage but are not suitable for analyzing the structure of the stored networks to give a qualitative
analysis; therefore special models adapted to the needs are designed, such
as graph-based models. These models are often extracted from data stored
in a database according to the traditional database model. Computational
models are used to explain biological systems by simulating the system and
give a quantitative analysis.
Pathway databases contain data on pathways, the components (enzymes, substrates and products) and the interactions. There are a number
Bioinformatics
41
of databases for enzymatic catalysis, protein-dna interactions, metabolic
pathways and protein-protein interactions. Most of the databases are relational databases.
A pathway database typically contains information on substances (proteins or chemical substances) and reactions. Reaction data typically links
to participating substances. The reaction can be a part of a larger entity
such as an experiment or compartment. Data on substances often contains an external reference to one or several well-known databases such as
the UniProt/Swiss-Prot Protein Knowledgebase [swi05] to identify the substance. Interaction data could contain references to literature describing
the experiment, such as a Entrez PubMed id [ent].
There exists a number of established databases for protein interactions,
some of them described here.
3.5.1
KEGG
kegg (Kyoto Encyclopedia of Genes and Genomes) are Kyoto University’s databases for genomic information, information on gene functions
and pathways. The three databases are genes, ligand and pathway
[KG00]. Information can be accessed in a top-down manner (starting from
the pathway) or bottom-up (starting with genes). genes contains completed genomes for a number of species.
3.5.2
DIP
dip (Database of Interacting Proteins) [dip04] at University of California has combined information from a number of sources to create a single
catalog on protein-protein interactions. These interactions have been experimentally determined. The data has been manually and automatically
curated. Searching is done by first finding the initial protein by keyword
search and then the interaction network can be explored by following interaction links. dip exports its data in the xin (see Section 3.7.2), flat-file
text and psi mi formats
42
3.5.3
3.5. Databases
MINT
mint (Molecular Interactions Database) [min05] is database that store interactions between biological molecules. The interactions are experimentally verified and mostly from mammals. The interactions can be viewed
graphically in a Java applet. Data is available in the psi mi format and in
a flat file format.
3.5.4
BIND
bind (Biomolecular Interaction Network Database) [bin] is a database to
capture information about interactions, molecular complexes and pathways. Interactions can be between proteins, rna, dna, molecular complexes, small molecules, photons and genes. The interactions included are
experimentally verified and published in at least one publication. Interaction information contains, in addition to description of the two biological
objects, a publication reference to PubMed.
An interaction between two or more objects is the basic unit of the
database. Interactions can be linked to form molecular complexes or pathways.
Data is available in a proprietary xml format.
3.5.5
Reactome
Reactome [rea] is a database on biological pathways, mainly human but
there are pathways from other species as well. The data published has
been peer reviewed and has either a literature citation or an electronic
inference. The human pathways can be navigated through a constellation
called starry sky that visualizes the connections between pathways. Data
is available in sbml format (see Section 3.6.1) as well as a database dump
and a Protégé project file.
3.5.6
IntAct
IntAct [int05] is an open source database and toolkit for protein interactions. By releasing it as open source the creators hope that more organizations will use it and thereby provide data using the same data model and
Bioinformatics
43
infrastructure. It currently contains nearly 40,000 interactions [Sam04].
IntAct uses controlled vocabularies for describing participating substances.
Data is available in the psi mi format.
3.6
Proposed standard formats
Since there exists a number of different databases that export their data
in different formats the need for a standard format has been recognized by
a number of groups. This has resulted in several standard proposals for
interchange of pathway data. Until now the most supported format seems
to be psi mi.
3.6.1
SBML
sbml (Systems Biology Markup Language) [HFSB03] is a data format for
describing models of biochemical networks. The aim is a little different
from that of psi mi and Biopax that are more concerned with exchange of
pathway database contents. The format is based on xml and is to be used
for exchanging data between different software packages and databases.
sbml is developed in levels, where the lower levels contain features
for basic information and higher levels will add features for more specific
information. Currently level 2 [FH03] has been reached.
A model of a pathway is built using a number of central parts (see
Figure B.1 in the appendices): compartments (a container of finite size),
species (reactants), reactions and events (discontinuous change of variables,
e.g. a change of concentration, triggered by a condition). Other parts are
parameters and functions.
sbml contains a special unit system that allows definition of units, containing information on dimensionality and conversion to other units.
Listing 3.1 shows a simplified example of a sbml document exported
from the Reactome database. The species (different chemical substances)
have a compartment attribute that refers to a compartment specified in
listOfCompartments. A reaction consists of reactants, products and
modifiers that are all specified by speciesReferences to the species list.
44
3.6. Proposed standard formats
Listing 3.1: Simplified Reactome sbml example
<?xml version="1.0" encoding="UTF-8" ?>
<sbml xmlns="http://www.sbml.org/sbml/level2" level="2" version="1">
<model name="REACTOME">
<listOfCompartments>
<compartment name="R_extracellular" id="R_extracellular"/>
<compartment name="R_plasma_membrane" id="R_plasma_membrane"/>
</listOfCompartments>
<listOfSpecies>
<species name="R_109266_5_nucleotidase_ecto_CD73_holoenzyme"
compartment="R_plasma_membrane" id="
R_109266_5_nucleotidase_ecto_CD73_holoenzyme" />
<species name="R_109276_H2O" compartment="R_extracellular" id
="R_109276_H2O" />
<species name="R_109277_orthophosphate" compartment="
R_extracellular" id="R_109277_orthophosphate" />
<species name="R_83908_adenosine" compartment="
R_extracellular" id="R_83908_adenosine" />
<species name="R_109275_adenosine_5_monophosphate"
compartment="R_extracellular" id="
R_109275_adenosine_5_monophosphate" />
</listOfSpecies>
<listOfReactions>
<reaction name="R_adenosine_5_monophosphate_A..." id="R_109278">
<listOfReactants>
<speciesReference species="R_109275_adenosine_5_monoph..." />
<speciesReference species="R_109276_H2O" />
</listOfReactants>
<listOfProducts>
<speciesReference species="R_109277_orthophosphate" />
<speciesReference species="R_83908_adenosine" />
</listOfProducts>
<listOfModifiers>
<modifierSpeciesReference species="
R_109266_5_nucleotidase_ecto..." />
</listOfModifiers>
</reaction>
</model>
</sbml>
Bioinformatics
3.6.2
45
PSI MI
psi mi (hupo psi’s Molecular Interaction Format) [HUP] is a format from
Human Proteome Organization’s Standards Initiative for access of database content on protein-protein interactions to allow comparative analysis
regardless of the database used. The standard defines a small data model
for core data with references back to the original data sets for full information (see Listing 3.2 and Figure B.2 in the appendices). Much effort has
been put into making it possible to standardize the contents of attributes.
Controlled vocabularies and ontologies are used [HMPB+04], both external
and some specially created for the psi mi format.
Listing 3.2 shows a simplified example of a psi mi file exported from
the IntAct database. The proteins are described as proteinInteractors
in the interactorList with name and id. The interactions are listed in
the interactionList. The participating proteins are listed as references
(proteinInteractorRef) to the proteins in the interactorList, together
with the role (bait or prey).
Listing 3.2: Simplified IntAct psi mi example
<?xml version="1.0" encoding="UTF-8"?>
<entrySet level="1" version="1" xmlns="net:sf:psidev:mi">
<entry>
<experimentList>
<experimentDescription id="EBI-79223">
<names>
<shortLabel>asahi-1999-1</shortLabel>
<fullName>coip of ATP2A1 (SERCA1) with PLN</fullName>
</names>
</experimentDescription>
</experimentList>
<interactorList>
<proteinInteractor id="EBI-79241">
<names>
<shortLabel>ata1_rabit</shortLabel>
<fullName>Sarcoplasmic/endoplasmic reticulum calcium.../
fullName>
</names>
</proteinInteractor>
46
3.6. Proposed standard formats
<interactionList>
<interaction>
<names>
<shortLabel>pln-atp2a1</shortLabel>
</names>
<experimentList>
<experimentRef ref="EBI-79223"/>
</experimentList>
<participantList>
<proteinParticipant>
<proteinInteractorRef ref="EBI-79241"/>
<role>prey</role>
</proteinParticipant>
<proteinParticipant>
<proteinInteractorRef ref="EBI-79236"/>
<role>bait</role>
</proteinParticipant>
</participantList>
<interactionType>
<names>
<shortLabel>aggregation</shortLabel>
<fullName>aggregation</fullName>
</names>
</interactionType>
</interaction>
</interactionList>
</entry>
</entrySet>
3.6.3
BioPAX
Biopax (Biological Pathway Exchange) is an ontology for sharing pathway
information that is defined in owl (the Web Ontology Language, see Section 2.4.2). Both hupo psi and the group behind sbml are collaborating
with the Biopax group. There is an owl-based file format using the Biopax
ontology available, but a file format could also be specified as an extension
to another format like psi mi, in xml directly or in another format.
Bioinformatics
47
At least three levels of Biopax are planned, but so far only level 1
[Wor04] has been released (drafts of level 2 are available). This level supports four different interactions: biochemical reactions, enzyme catalysis,
transport catalysis and complex assembly. The physical entities supported
in the model are: small molecules, proteins, rna and complexes. Next level
will add binding interactions and level 3 will add signaling pathways.
3.7
Proprietary exchange formats
There are a number of proprietary exchange formats each developed with a
special database in mind. These often map directly to the internal database
schema.
3.7.1
KGML
kgml (kegg Markup Language) [Lab] is the exchange format for kegg (see
Section 3.5.1) graph objects. It allows automatic drawing of pathways and
computational analysis of protein and chemical networks [Lab]. The format
contains information about the graphical representation of the pathways.
See Figure B.3 in the appendices.
3.7.2
XIN
xin (Extensible Interaction Network) [oC] is an exchange format for dip
(see Section 3.5.2). A xin file consists of three sections: attributes, nodes
and edges. The attributes section contains attributes assigned to all nodes
and edges, either as default values or referenced from individual attribute
entries in each node or graph. To each individual node or edge a number
of feature entries can be added, these are used to specify cross-references
to other protein databases and list references to experimental data.
3.7.3
BIND
bind uses Abstract Syntax Notation One (asn.1) as its native data description language. asn.1 is a language from the telecommunications world for
48
3.8. Summary
describing messages to be interchanged between applications [asn04]. Due
to the complexity and pricing of tools it has never really caught on at
application level [Kay03]. Data in asn.1 are strongly typed and have binary encoding. Both are features not found in xml (though typing can
be achieved by using xml Schema and binary encoding schemes for xml
are under development [xml05]). Because of the wide-spread support for
xml, bind also offers the data in xml format, but due to the amount of
detail of information about interacting molecules and the lack of binary
encoding the file sizes are large. The asn.1 data is mapped to xml using
the ncbi (National Center for Biotechnology Information) programming
toolkit [BBH03][NCB03].
Data is structured around interaction pairs, data are often repeated for
each entry and cross-referencing is used sparsely [Str03]. The deep nesting,
somewhere around 20 levels at most, of elements makes it hard to read for
humans.
3.8
Summary
Proteins are long polymers of amino acids that are important for the life
of organisms by building structures (such as skin), controling chemical reactions, genetic transcription and so on. There are 15,000 known proteins.
A network of interconnected chemical processes within a cell is called a
biochemical pathway or biochemical network. Reactions that include proteins are called protein interactions.
Mapping of pathways is important to be able to understand how proteins affect the organism. The amount of information, the heterogeneity
and incompleteness of data make this a demanding process.
Protein interactions are determined mainly by two different methods,
two-hybrid and phage-display systems. Both methods are susceptible to
false positives and data collected is therefore not necessarily correct. The
experimental data is stored in databases. There exist a large number of
databases that store this kind of data. The data can simply be the result of
experiments, experimental data curated to avoid errors or manually built
models of pathways. Some of the databases have their own xml-based
export format but there also exists a number of proposals for a standardized
Bioinformatics
49
xml-based format for exchange of protein-interaction data. psi mi seems
to be the most supported of those formats at the moment, while Biopax is
used very little. sbml has a different focus since it is meant for expressing
models of biochemical networks rather than just experimental data.
50
3.8. Summary
Chapter 4
Problem analysis
This chapter presents what aspects we investigated, which data sets we
used and what queries we performed. The selection of technologies is also
motivated.
4.1
Questions to be answered
We wanted to investigate the possibility of using xml tools for search and
discovery in protein-interaction data. We also wanted to test expressibility
and performance, compare capacity and performance of different tools and
compare different proposed standard formats for protein-interaction data.
To be able to do this we needed to have relevant queries to perform.
These queries had to be selected both to get information from a bioinformatics point of view but also to test different aspects of the capabilities
and performance of the different xml tools.
4.1.1
Query capability
We were to investigate the expressibility in the xml query language selected. What queries it is possible to express and how complicated they
will be? We also wanted to test the compliance to the query language in
51
52
4.2. Chosen datasets
different xml tools.
The queries were to involve some of the challenges for xml queries
presented in [SWK+01]: path traversals, references, joins and construction
of large results. In addition the queries also had to cover the following
characteristics as presented in [YzK04]: exact match, function application
(aggregate functions) and quantification.
4.1.2
Efficiency
The data files are huge, rendering in large processing time and hitting the
limit of available internal memory. We wanted to investigate the efficiency
of the tools when it comes to speed and memory usage.
Speed xml files lack information of where in a file a specific element can
be found or even where the next elements start. This means that
the file must be parsed byte by byte from the beginning to find a
specific tag matching the given query. Given the large file sizes the
processing speed of xml files will probably be very slow if no storage
of data in tree form in internal memory or indexing is used. Due to
the structure of xml traversing the file to find sibling, child or parent
nodes may be expensive if an exhaustive search is to be done on a
large subtree.
Memory usage The size of the files also means that it is not certain that
entire xml files can be parsed into dom trees for fast and effective
processing of subsequent queries, since this is stored in internal memory.
4.2
Chosen datasets
In order to evaluate the xml tools we needed to select what databases and
datasets to use and what queries to perform.
Problem analysis
4.2.1
53
Databases
psi mi seems to be the most supported format for protein interaction data.
It is available as an alternative download format in a number of databases,
the availability of pathway information stored in psi mi format is relatively
high.
The availability of sbml data was lower than for psi mi on the databases
we examined. The most interesting site offering sbml data we found was
Reactome (see Section 3.5.5) .
The two formats chosen for testing were psi mi and sbml. For the
psi mi format the dataset for Drosophila melanogaster from the IntAct
database was selected. The dataset consisted of nine files with the total
size of 29.3 MiB. The sbml data source chosen was Reactome’s dataset for
human reactions available in sbml level 2, version 1. The file was 3 MiB
large.
Since it is not possible to make out reactants and products in psi mi
interactions we decided to concentrate the pathway searching efforts to
the Reactome dataset, where it was easier to understand the biochemical
processes. It still is, however, useful to perform pathway searches also on
the psi mi data set.
4.2.2
Queries
From a bioinformatics point of view we wanted to know a number of things.
The most central, and most difficult to answer, was to find a pathway between two different proteins. Other things we wanted to know was if there
existed specific substances and interactions, what interactions a given protein were part of and if two given proteins took part in the same interaction.
From a biological standpoint the size of the datasets is maybe not so
interesting but could be a simple measure to understand the performance
of the other queries.
These were the selected queries for IntAct/psi mi:
1. Find information on a given protein. Protein id is given.
2. Find information on a given experiment description. Experiment
description id is given.
54
4.2. Chosen datasets
3. Find information on a given interaction. Interaction id is given.
4. Find the protein information for the proteins that are part of a given
interaction. Interaction id is given.
5. Find the experiment description information for an experiment description that is part of an interaction. Interaction id is given.
6. Find all interactions that a given protein is a part of. Protein id is
given.
7. Find all interactions that a given experiment description is part of.
Experiment description id is given.
8. Find any interactions that two given proteins are a part of. Protein
id for the two proteins are given.
9. Count the number of proteins in the database.
10. Count the number of interactions in the database.
The selected queries for Reactome/sbml were:
1. Find all information on a given compartment. Compartment id is
given.
2. Find all information on a given species. Species id is given.
3. Find all information on a given reaction. Reaction id is given.
4. Given a reactant id, find all information on those reactions that have
a protein listed as reactant that is also listed as a product in the given
reaction.
5. Given two proteins, find out if there is a given pathway between them.
Protein ids are given for the two proteins.
6. Count the number of species in the database.
7. Count the number of reactions in the database.
Problem analysis
4.3
55
Chosen technologies
Two different approaches were chosen. The first was the construction of a
platform consisting of Java programs connecting to a native xml database
using XQuery as query language, see Section 4.3.1. Since we suspected that
pathway queries would take quite long time to perform and these queries
were important a second approach was made especially for these queries
using a C++ program and a graph toolkit, see Section 4.3.2.
4.3.1
Native XML databases and XQuery
A number of query languages for querying xml data has emerged. There
is, however, currently only one widely supported and actively developed
language, XQuery. A future standardization will probably result in XQuery
becoming even more common.
The problem with processing speed and memory constraints made it
impossible to load entire files into memory. To overcome this, the use of a
system using indexing and allowing processing of files directly on disk was
essential. A native xml database supports this as well as xml queries.
Different nxd implementations with XQuery support together with a
stand-alone XQuery implementation was to be compared for speed and
memory constraints.
ˆ Exist
ˆ Sedna
ˆ X-Hive db
ˆ Qizx/open
The stand-alone XQuery implementation without indexing, Qizx/open, was
included for comparison, to see if there were any real benefits with a native
xml database.
Almost all nxds have some sort of Java api, while support for other
languages are sparse. Java Servlets provide an easy way to provide web
connectivity. Java was also one of the languages we were most familiar
56
4.3. Chosen technologies
with. Java programs are possible to run on different platforms without the
need for recompilation. However, this is not much of an advantage if a web
interface is used.
4.3.2
The Graph Template Library
The gtl (Graph Template Library) [FFRB03] is a graph and graph algorithm extension of the Standard Template Library in C++. It was chosen
because it is widely spread and has a free licence for non-commercial use.
The graphs can be directed or undirected and there are a number of algorithms for finding shortest path as well as algorithms for testing planarity
and finding maximum flow and minimal spanning trees.
There are however no algorithm for finding all possible paths between
two nodes. This had to be implemented in the program.
Graphs can be built within the program using functions provided by
the gtl. They can also be imported from file using the provided file reader
function. The graphs are defined with gml, Graph Markup Language, a
non-xml format for graphs. A graph is defined by describing all nodes and
then the edges between nodes.
Listing 4.1: gml example [Him96]
graph [
comment "This is a sample graph"
directed 1
id 42
label "Hello, I am a graph"
node [
id 1
label "Node 1"
thisIsASampleAttribute 42
]
node [
id 2
label "node 2"
thisIsASampleAttribute 43
]
node [
Problem analysis
id 3
label "node 3"
thisIsASampleAttribute 44
]
edge [
source 1
target 2
label "Edge from node 1 to node 2"
]
edge [
source 2
target 3
label "Edge from node 2 to node 3"
]
edge [
source 3
target 1
label "Edge from node 3 to node 1"
]
]
57
58
4.3. Chosen technologies
Chapter 5
Native XML database
setup
The test was performed to analyze the capabilities and performance of
different xml query systems. The following database systems and XQuery
processors were tested:
ˆ Exist
ˆ X-Hive
ˆ Sedna
ˆ Qizx/open
To get a reasonably fair comparison, all products were tested on the same
machine running Microsoft Windows XP (specifications in Section 5.1.6).
The initial tests were carried out both by using the gui (if available) and
by using its Java api. Further testing was only done against the Java api,
to be able to test efficiently and get correct query time results.
The test consisted of queries in the XPath and XQuery languages.
59
60
5.1
5.1. Native XML databases
Native XML databases
This is a description of the logical structure and indexing in the different
nxd implementations. They have different approaches that affect how the
user accesses different documents and the performance of queries.
5.1.1
Exist
The Exist [exi05] database consists of collections and documents. A collection can in turn contain collections and documents. Exist can, unlike
X-Hive and Sedna, only have one database per installation.
Exist has three types of indexing [exia]: full-text, structural and range
indexing. The structural index is always on and used in almost all XPath
and XQueries.
The full-text indexing is by default on for all text-nodes. To use the
full-text index the queries must be rewritten to use the special operators
&= and |= or special functions like near() and match-all().
The range index is an index that can be defined by connecting a data
type to an element (a node in the xml tree). The index will then be used
in expressions that compare the element with values of the same data type.
The XQueries were rewritten to make use of the full-text index.
5.1.2
Sedna
In Sedna [sed05] xml data is stored in documents. These documents can
be stored stand-alone in the database or in a collection [MOD04]. Queries
can be made against a specific document, collection or specific document
within a specific collection.
Sedna makes use of a descriptive schema [GFK04] to store information of the structure of the documents in the database. Unlike prescriptive
schemas (like dtd and xml Schema, see Section 2.2) descriptive schemas
are generated dynamically when the documents are loaded into the database. This descriptive schema is then used as an index structure.
Native XML database setup
5.1.3
61
X-Hive
The logical data model in an X-Hive [xhi05] database consists of a root
library containing other libraries. Libraries can in turn contain other libraries, documents (xml data) and catalogs (containing dtds and xml
Schemas).
There are a number of different settings for indexing in X-Hive. The
library containing xml files has a library name index by default. The
root library also has a library id index. Indices that can be added are id
attribute indices, element name indices and value indices (by element value
or attribute value).
5.1.4
Qizx/open
Qizx/open [Fra05b] is an XQuery implementation that works directly against
xml stored as usual files on disk or data stored in sql databases. The data
is not read until the query is executed, there are no special storage formats
or indices - this obviously makes queries slower than on nxds.
5.1.5
Java
Java is used in all tested XQuery implementations in one way or another.
X-Hive and Exist use Java both on the server and client sides. Qizx/open
is fully implemented in Java. The Sedna api used is also Java-based, the
server part is however compiled C code.
The heap and stack sizes were not altered in any way from the default
setup.
5.1.6
Machine setup
The tests were done on a ibm X40 with a Intel Pentium 1.2 GHz processor
and 512 MiB internal memory running Microsoft Windows XP. Windows
was chosen since compiled Sedna binaries for Microsoft Windows were available.
62
5.2
5.2. Queries
Queries
This section describes the queries used in the tests.
5.2.1
Type of queries and efficiency
Most of the queries were written as simple path expressions with arbitrary
depth (nesting) using // and some conditional. One problem with using
an arbitrary depth location step operator was that searches took longer
time since more of the tree had to be traversed. Since the structure of the
document is fixed it would be possible to rewrite the queries to use the
full path. It was, however, not certain if the query engine benefitted from
this. If an index is used it can be faster to retrieve an element at arbitrary
depth than specifying the full path since all steps must be verified for a
match [X-H04]. Another possible optimization could have been to limit
the number of returned entries when the expected number of entries are
one as it is for entries with unique ids. If there only is one entry the
XQuery engine does not have to search the rest of the document once the
data searched for is found.
flwor queries were used for complex queries that were easier to grasp
in flwor form.
5.2.2
Description of queries
The simple path expressions are used in most of the queries. Listing 5.1
shows an example of a simple query (IntAct 1). The first line is the
namespace declaration, binding p with the namespace bound to the uri
“net:sf:psidev:mi”. The next line is an XPath expression for a proteinInteractor
element with the specified id attribute at any depth in the rat_small document. A textual description of the queries can be found in Section 4.2.2
and the queries expressed in XQuery can be found in Appendix C.
Listing 5.1: Query example
declare namespace p = "net:sf:psidev:mi";
document("/psimi/intact/rat_small.xml")//p:proteinInteractor[
@id="EBI-77471"]
Native XML database setup
63
IntAct 1-3 and Reactome 1-3 are very basic queries. They are path expressions (as defined in [YzK04]) with one conditional for attribute (IntAct
1-2, Reactome 1-3) or character data (IntAct 3).
IntAct 6-7 are path expressions with several levels in the conditional.
IntAct 8 is a path expression with an and query and several levels in the
conditional, see Listing 5.2.
Listing 5.2: Query with several levels in conditional
declare namespace p = "net:sf:psidev:mi";
document("/psimi/intact/rat_small.xml"//p:interaction[p:
participantList/p:proteinParticipant/p:
proteinInteractorRef/@ref="EBI-77460" and p:
participantList/p:proteinParticipant/p:
proteinInteractorRef/@ref="EBI-77480"]
IntAct 9, 10 and Reactome 6,7 are simple queries using the count aggregate function. These are used to give a measure of the size of the datasets.
IntAct 4 uses joins to get information on proteins in an interaction.
IntAct 5 uses joins to get information on what experiment an interaction
is part of. While query 4 uses a flwor expression (see Listing 5.3), query
5 uses a nested XPath expression.
Listing 5.3: Join using flwor expression
declare namespace p = "net:sf:psidev:mi";
for $ref in document("/psimi/intact/rat_small.xml")//p:
interaction[p:names/p:shortLabel="dlgap1-dlg1-1"]/p:
participantList/p:proteinParticipant/p:
proteinInteractorRef/@ref
return document("/psimi/intact/rat_small.xml")//p:
proteinInteractor[@id=$ref]
Reactome query 4 contains a recursive function findMolecule that returns elements for found connected proteins as well as information on when
the max depth is reached. The element makes it easier to automatically
evaluate the results. The function takes a reactant id and a cut-off depth
as parameters.
64
5.2. Queries
Reactome query 5 is a continuation of query 4. There are two versions
of this query, one basic without filtering of the results and one more advanced with filtering. The most important part is the recursive function
findMolecule that in addition to start reactant and cut-off depth also
takes goal reactant as a parameter, see Listing 5.4. Until the goal molecule
is found or the cut-off depth is reached it will continue to call itself with a
new start reactant. The function also performs a simple loop detection, it
will not try to check a reactant with the same id as it is currently at.
Listing 5.4: findMolecule recursive function
local:findMolecule($molecule as xs:string, $goalMol as xs:
string, $n as xs:integer) {
for $i in document("/sbml/reactome/sbml.xml")//s:reaction[s:
listOfReactants/s:speciesReference/@species = $molecule
]/s:listOfProducts/s:speciesReference/@species
return <item reaction="{$i/parent::node()/parent::node()/
parent::node()/@id}">{$i} {
if($i = $goalMol)
then <found/>
else if($i = $molecule)
then <loop />
else if ($n = 1)
then <max/>
else local:findMolecule($i, $goalMol, $n - 1)}
</item>
};
The filtering part uses the xml data returned by findMolecule to only
list paths that actually reach the goal reactant. A flwor statement and
a recursive function, getParent, is used to filter the data.
First the <found/> elements is located in the findMolecule output,
then the tree is traversed upwards and the path is returned as a tree. The
benefits of using this filtering is not just a more intelligible result but also
faster queries since no time has to be spent on serializing xml results for
paths that do not reach the goal reactant. This filtering does not work in
Qizx/open, the query used there does not include the post-filtering part.
Native XML database setup
5.2.3
65
XML serialization
Most queries produce xml output. This data has to be constructed into a
tree, or serialized , and that is a time-expensive operation. This process is
a factor in operations like reconstruction (round-tripping) and construction
of large results. xml construction takes increasingly longer time with an
increased number of nested elements.
5.3
Test framework
All three nxds were connected to via Java apis. Since all queries were to be
processed by all database systems, a common framework was developed to
gather operations needed when interacting with all of the database systems.
These operations were:
Input parameters Reading of input parameters from the command line.
These parameters were the name of the database, the file to import,
the proteins to search and so on.
Performing queries The queries were stored as methods which took a
number of parameters and then called the database specific query
method.
Iterations The queries were iterated a number of times, in this case ten,
to reduce the effect of random spikes in query times caused by the
operating system and other processes on the test system.
Time calculations The minimum, maximum and mean values for the
search times for each query were calculated.
Presentation The test results were printed out with name, the calculated
values and all search times.
Parts of the program that were different from database to database were:
Database connection All the different systems had different methods to
connect to the database system. Some used a connection via their
own api, while others use an xml-rpc connection.
66
5.4. Benchmarking
Data loading Some of the database systems took the file name as an
input parameter to load the file as a document into the database,
others received the file contents over the established connection.
Query sending The query sending method involved taking the query
passed on as a parameter from the query method an passing it on
to the database system. This method measured the time elapsed between the sending of the query and the receipt of the results. The
method made sure that data was returned from the database system
and then returned the measured time.
Document dropping Document drop in the database was done in different ways in the different tools. X-Hive had a special api function
while Sedna received the command in the same way as XQuery commands.
Connection closing This was where the connection to the database system was closed, also this is highly specific for the different database
systems.
5.4
Benchmarking
For Exist, X-Hive and Sedna the provided Java apis were used to connect
to the database and ask queries. Since we did not get Qizx/open’s Java
api to work we used the evaluation time returned by the command line
tool. It soon became obvious that the timing resolution provided by Java
under Microsoft Windows was not especially good, somewhere between 1516 milliseconds, which also was indicated by other sources [Rou02, Man01].
Both propose using a dll (Dynamic Linked Library) combined with jni
[jni04] (Java Native Interface) to access the cpu clock which gives better
resolution than Java’s System.currentTimeMillis() call.
Roubtsov’s hrtlib [Rou02] was used for measuring time for queries
on Exist, X-Hive and Sedna. Since Qizx/open’s command line tool, which
seems to use a System.currentTimeMillis() call internally, was used the
measured query times for that database had poor resolution.
Native XML database setup
67
Since Exist, X-Hive and Sedna all use a socket-based connection to
reach the XQuery engine from the api the exact times could depend on
external factors such as other processes running on the system. Despite
efforts to minimize the number of other running processes on the computer
results would differ from time to time for the same queries. To decrease the
influence of sudden spikes in measured time all queries were run ten times.
The numbers given in the tables in Section 7.1 and 7.2 are the mean values
of ten iterations.
Qizx/open does not use sockets and were not susceptible to delays in
the same way, together with the decreased resolution the received number
will almost always be the same given an even load on the system. The
numbers for Qizx/open are based on a mean value of three iterations.
68
5.4. Benchmarking
Chapter 6
GTL test setup
This chapter describes the graph toolkit and the work to build a pathway search application using it and how to process sbml data using the
application.
6.1
The GTL package
gtl [FFRB03] is a C++ library for graphs and graph algorithms. It contains classes needed to work with graphs, nodes and edges and a few algorithms, such as breadth and depth first search, Dijkstra’s algorithm and
algorithms for finding the minimal spanning tree of a graph. There are
constructs such as node and edge iterators, much like iterators in C++’s
Standard Template Library (stl).
Since there were no algorithms in the gtl package for finding all possible
paths between two nodes this had to be implemented using the existing
graph, node and edge classes as building blocks.
6.2
Transformation
The protein interaction data had to be converted into graphs before import, this was done by transforming from the initial xml format to the
69
70
6.3. The program
Figure 6.1: Conversion of one reaction to a number of edges
gml format. Protein interaction data in sbml format was transformed to
the gml format using an xsl transformation (see Algorithm C.1 in the
appendices). Since only numerical node ids are allowed in gml and the
ids in the Reactome sbml files are alphanumerical (see Listing 3.1) the
transformation must extract the numerical part (the part between the first
and second underscore) of the id.
The graph was modeled to have substances as nodes and reactions as
edges. Since a reaction typically involves at least two reactants and can result in more than one product several edges were created for every reaction
(see Figure 6.1). Modifier substances were not included in the graph.
The xsl transformation was done using the msxsl command line transformation utility [Kim01] from Microsoft and it took about 2 seconds to
transform the 3 MiB large Reactome sbml file on the system described in
Section 5.1.6.
6.3
The program
Before the pathway search is performed there are a number of steps performed to assure that there is a way between the two nodes and to simplify
the graph as much as possible.
GTL test setup
6.3.1
71
Removal of extraneous edges
After the gml file is read into the C++ program the graph is traversed.
Non-connected nodes are deleted from the graph and the graph is controlled
for existence of start and end nodes. If they are not found the program
exits.
Additional edges in the same direction between the same two nodes are
removed if the direction of the pathway is to be taken into consideration,
otherwise all additional edges between two nodes are removed regardless
of direction. These extraneous edges exist due to how the transformation
from the sbml data is done, two nodes can be participating as reactant
and product in several reactions.
The reason for removing all extraneous edges is to avoid duplicates in
the result and to avoid the unnecessary time it would take to search paths
including these edges. To find and remove these additional edges proved to
be one of the most difficult parts in developing this program.
For every node all edges are added to a list that is checked for uniqueness using the stl list unique function and a specially written comparison
function that compares target ids for edges. If there are duplicates the
sorted list is iterated and duplicate entries added to a list for later deletion.
6.3.2
Control of reachability and leaf deletion
A search using gtl’s built-in breadth-first search algorithm is used to verify
that the end node really can be reached from the start node. The information on node reachability gained with the breadth-first search algorithm is
also used to find out what nodes to check for leaf deletion.
Leaves are nodes that are only connected to one other node. Since there
is no way to find a path to the end node (unless the node is the end node)
all leaves that are not the start (S) or end (E) node can be deleted from
the graph, see Figure 6.2. This optimization of the graph could lead to a
decrease in search times. The leaf deletion was together with the removal
of extraneous edges the most difficult to develop, mostly because you have
to keep track of when you can delete an edge without the risk of calling a
non-existant edge.
The function also contains code to identify cases where two nodes are
72
6.3. The program
Figure 6.2: Node deletion
connected by two edges going in two directions. This case only matters,
however, when non-directional searches are done, which is not the case with
the search in the sbml data.
The leaf deletion is run several times, since long lines of one-connected
nodes (see the left-most nodes in Figure 6.2) can connect to the graph and
will not lead to the end node. The deletion is looped until the number of
deleted edges in the current iteration reaches zero.
6.3.3
Path search
The search between the start and end node is done by a recursive function, findAllPaths, that work outwards from the start node and follows
outgoing edges. If the program is instructed to do non-directional searches
ingoing edges are also followed.
For every new node the function is called with the graph as one of
the arguments. Another argument is an cut-off depth at which to stop
searching, as with the corresponding XQuery. This because the run-time
would otherwise be really long due to the number of possible paths to
be searched. The graph sent as an argument has its already visited nodes
GTL test setup
73
marked as hidden to avoid loops. Returned is a list containing all pathways
in the form of protein ids along the path. To extract the pathways from
this list a search for the end node is done.
6.4
Benchmark methods
The often used C call clock() gives imprecise results. The C++ program
used Microsoft Windows QueryPerformanceCounter [Mic04] and QueryPerformanceFrequency calls to get high-resolution timing of microsecond accuracy [Kra]. The numbers given in the relevant Table 7.4 is based on a
mean value of ten iterations.
74
6.4. Benchmark methods
Chapter 7
Results
Since the queries for the IntAct and Reactome databases differ quite a
bit the results are presented in two separate sections. The section on Reactome takes up the Native xml database approach as well as the C++
approach. This chapter also contains information on differences between
XQuery implementations.
7.1
Queries on IntAct data
The results presented in Table 7.1 were achieved when making queries on
data from the drome giot data set. The nine files were combined into one
file of size 29.3 MiB using an xsl transformation. All values for queries
except those for Qizx/open are mean values for ten queries. The values for
Qizx/open are mean values for three queries. The load and drop times for
the xml databases were tested one time.
The search times for Sedna were unequally distributed; the first search
of each type typically took somewhere between 5 and 30 times longer than
the following searches. The column “worst case” shows the iteration of the
query that took the longest time. This indicates that queries are cached.
Similar behavior were not found for the other tools.
Sedna is the system that had the best performance for all queries except
75
76
Query
1.
2.
3.
4.
5.
6.
7.
8.
9.
10.
load
drop
7.1. Queries on IntAct data
Exist
510
329
172
782
602
739
2092
1 308
24
73
ms
ms
ms
ms
ms
ms
ms
ms
ms
ms
167 583 ms
14 951 ms
Table 7.1: IntAct queries
X-Hive
Sedna
mean
worst
2 537 ms
68 ms
102 ms
5 ms
5 ms
6 ms
1 291 ms
127 ms
589 ms
1 302 ms
182 ms 1 119 ms
3 153 ms
61 ms
63 ms
223 ms
153 ms 1 388 ms
60 ms
20 ms
148 ms
1 758 ms
208 ms
927 ms
2 968 ms
12 ms
58 ms
2 945 ms
8 ms
9 ms
23 472 ms
176 ms
23 054 ms
8 829 ms
-
Qizx/open
6 224 ms
6 182 ms
6 339 ms
16 ms
20 115 ms
6 489 ms
6 459 ms
6 443 ms
4 995 ms
4 995 ms
-
for query 4 (Protein interaction info) where Qizx/open surprisingly was the
fastest.
X-Hive had remarkably high query times for many of the queries, it did
not seem to cope with the large amounts of data. The long query times for
the count (query 9 and 10) functions compared with those for Sedna and
Exist may be an indication of different storage strategies for this type of
data.
The long search time for query 4 (Protein interaction info) and 5 (Experiment interaction info) on X-Hive probably is due to the fact that the
query makes use of element data, whereas the other queries make searches
on id attributes. An index for element values can be built. The query must
then be written in a special way to make use of the index, which was not
successful in this case despite several attempts.
The query times for Exist were all over 100 ms except those for the
aggregation functions in query 9 and 10. This could be a result of a special
way to answer those queries using some index or that no xml serialization
77
Results
Table 7.2: Reactome queries
Query
Exist
X-Hive
Sedna
1. Compartment info
29 ms
6 ms
7 ms
2. Species info
22 ms
2 ms
6 ms
3. Reaction info
23 ms
8 ms
26 ms
4. Outgoing way
See Table 7.3
5. Pathway search
See Table 7.4
6. Count species
11 ms
71 ms
13 ms
7. Count reactions
20 ms
72 ms
5 ms
load
drop
9 908 ms
1 581 ms
2 143 ms
4 ms
1 541 ms
980 ms
Qizx/open
838 ms
875 ms
859 ms
724 ms
719 ms
-
was needed.
Sedna and X-Hive both had loading times of 23 seconds, while X-Hive
was significantly faster on dropping the document. The loading time for
Exist was very long. These two operations are probably infrequent and
would not matter. There was anyhow a time-benefit from loading the
document into an xml database compared to using Qizx/open even if only
performing four queries.
7.2
Queries on Reactome data
The results in Table 7.2 were achieved when making queries on data in
the Reactome sbml data set, 3 MiB large. All time values for queries
except those for Qizx/open are mean values for ten queries. The values for
Qizx/open are mean values for three queries. The load and drop times for
the xml databases were tested one time.
Unlike when searching the IntAct data the search times for Sedna were
distributed randomly. This could be a result of the smaller size of the data
not triggering a caching.
For this smaller data set Sedna was the fastest database system, the
difference to the other systems was however, not as large as for the IntAct
78
Steps
1
2
3
4
7.2. Queries on Reactome data
Table 7.3: Query 4. Outgoing way searches
Exist
X-Hive
Sedna
Qizx/open
212 ms
132 ms
120 ms
906 ms
806 ms
428 ms
157 ms
1 125 ms
29 939 ms
9 094 ms 3 044 ms
7 172 ms
732 885 ms
-
data set. Sedna was slower on query 3 than both X-Hive and Exist, it is
unclear why since query 3 is similar to query 1 and 2.
Unlike for the large IntAct data set X-Hive had very good search times
for the Reactome data set and was on par with Sedna or even faster. Again
X-Hive performed worse on the count aggregation function tests than Sedna
and even Exist did.
Exist’s performance compared to the other systems was significantly
better than for the large IntAct data set. The query times were about four
times longer than for Sedna and X-Hive.
Table 7.3 shows the query times for an increasing number of steps for
query 4 in section 4.2.2. Sedna crashed at 4 steps, probably due to lack of
memory. Qizx/open ran out of memory at 4 steps and Exist halted because
of memory settings.
X-Hive was unarguably more stable than Sedna but the query times
grew faster than for Sedna.
Table 7.4 shows the query times for an increasing number of steps for
query 5 in Section 4.2.2. Also for this query Exist, Sedna and Qizx/open
ran out of memory at 4 steps. The times and trends are about the same
as for query 4, Sedna query times grew slower than for the X-Hive, but
X-Hive was the only XML tool reaching 4 steps.
The C++ program developed for graph searches were magnitudes faster.
This depends on a number of things with the most important probably
being that the program is compiled C++ code instead of XQuery functions
interpreted in run-time by Java programs in a Java Virtual Machine (or in
Sedna’s case, compiled C and Scheme code).
Other reasons are that the C++ program has loop detection, deletion of
redundant edges and a data structure more suitable for graphs and graph
79
Results
Steps
1
2
3
4
5
6
7
8
9
Table 7.4: Query 5. Pathway searches
Exist
X-Hive Sedna Qizx/open
C++ program
w. leaf del. w/o leaf del.
0.42 s
0.33 s 0.12 s
0.91 s
0.0003 s
0.0003 s
0.95 s
0.65 s 0.25 s
1.13 s
0.003 s
0.003 s
33.32 s
9.06 s 3.49 s
7.20 s
0.015 s
0.015 s
700.4 s
0.101 s
0.101 s
0.770 s
0.823 s
5.48 s
5.56 s
35.32 s
35.83 s
212.9 s
215.4 s
1 193 s
-
searches with fewer and faster memory lookups. The deletion of leaves did
not affect the search times until searching for paths of more than 4 steps
and there were no time savings until searches for 8 steps were performed
with the tested data. The deletion of leaves took about 340 ms. This
depends heavily on the topology of the graph.
The most important factors are probably the deletion of redundant
edges and the better suited data structure.
7.3
Premature technique
The XQuery grammars were not consistent from implementation to implementation. The most remarkable difference was between Qizx/open and
other implementations. As a part of query 4 for Reactome (see Algorithm
C.15 in the appendices) we wanted to return the reaction id placed in
an attribute when the current node was the species attribute within a
speciesReference element node. The wanted attribute id is placed in an
element called reaction placed as parent to listOfProducts that in turn
is parent to speciesReference.
In all XQuery tools except Qizx/open the path expression became
$i/parent::node()/parent::node()/parent::node()/@id
80
7.3. Premature technique
whereas to get the same result in Qizx/open we had to use the expression
$i/parent::node()/parent::node()/@id.
This is because Qizx/open had a different notion of where in the tree the
species attribute is placed, i.e. it placed the attribute at the same level
as the element it resides in, in contrast to the other tools that saw the
attribute as a child node to the element.
Differences like these made it hard to move queries from one XQuery
implementation to another.
Another difference was spotted with X-Hive that did not execute a query
where if was directly followed by a parenthesis, i.e. if( vs. if ( .
In general differences seemed to be present in XQuery-specific parts
of the query tools, while XPath queries worked without troubles when
shifting from one implementation to another. This was a consequence of
XQuery still being in a development phase. Using a test harness, such as
the commercial BumbleBee [CCHDV03], may be a way to detect some of
these irregularities.
Pathway searches in Sedna tended to crash irregularly without any clues
to why. After some testing the problems disappeared if the pathway search
query was placed before the other queries, this worked even when the set
of queries were iterated ten times. This is probably a bug that will be fixed
before the release version of the software.
Chapter 8
Discussion
This chapter contains conclusions from the thesis work as well as suggestions for further work within this area.
8.1
Conclusions
The aim for this thesis was to investigate if it was possible to perform search
and discovery directly on xml files with protein-interaction data using existing xml tools, to compare different xml tools and their applicability on
protein interactions and out of this build a working demonstration system.
Since we suspected that this would be really slow using file-based xml
tools we decided to test different native xml database systems (nxds).
Other xml database types (like systems based on relational database management systems) were not tested since they were out of scope for this
work. The native xml database systems use indices and data structures
that decrease the query times.
The testing showed that it was possible to write queries on the proteininteraction data using the XQuery language. For a small dataset such as
the 3.1 MiB large Reactome data set all three tested nxds gave replies
within times acceptable for interactive use with the exception for pathway
queries.
81
82
8.2. Future work
For the 29.3 MiB large IntAct data set Sedna was significantly faster
than the other tested systems. Exist’s results were mostly performed in
under one second on the tested system while X-Hive gave the most mixed
result. This could be fixed by applying additional indices manually, the
queries must be written in special ways to make use of the indices and we
did not manage to rewrite them in a way to get increased performance.
To be able to process xml data in an nxd the data must be loaded into
the database and since indices must be updated and the data structured
according to the storage model this takes some time. Despite this there is
a time saving in using an nxd compared to using the file-based XQuery
implementation tested, after just four queries with Sedna or X-Hive with
the IntAct data set.
In the case of pathway queries the search times become long after just
a few steps and the tools do not seem to cope with pathway queries longer
than three steps. A C++ program for performing this query was developed
using a graph package and resulted in searches magnitudes faster than for
the xml databases making it possible to search even larger graphs.
The different XQuery implementations behaved differently for the same
code and some of them crashed under certain circumstances. They have
clearly not matured yet.
8.2
Future work
These are possible directions to continue from this thesis work as recognized
by the author.
8.2.1
More formats
The work for this thesis has been done with two formats, psi mi and sbml.
Since there are a number of other formats present for the same type of data
(as presented in Sections 3.5 and 3.6) a natural continuation would be to
test other formats.
Discussion
8.2.2
83
Data integration
The work has been done separately in two different formats. An obvious
next step would be to integrate data from different sources. This could be
done in different ways: either by conversion between formats or by writing
queries that take the different structures into account. To determine if a
protein described in two different formats are the same, external database
references can be used.
A conversion between formats may result in loss of data since all of
the data modeled in one of the formats is not modeled in the other. This
can be acceptable for some applications; another approach would be to
convert from the different formats to a super format that stores all type of
data. This format would not necessary be entirely new, using an existing
format and the capabilities of extension (either in the format itself, such as
attributes in psi-mi, or by using xml namespaces) would suffice.
8.2.3
Data integration with OWL
[LF04] describes a way of using owl (see Section 2.4.2) for creating a homogeneous data source from xml data sources no matter the structure of
the real data sources. Without the owl-based mapping on top, two different queries would have to be formulated for two different data sources
describing the same data but in two different ways. The xml Schema describing the xml document is mapped to owl, type definitions are mapped
to owl classes or data types and declarations to owl properties.
Since the xml data model is a tree and the owl data model is a graph
they use a modified variant of the XQuery language, called swql (Semantic
Web Query Language). Instead of XPath language it uses swqlPath to
select specific instances of an owl graph. The queries can then be expressed
as swql queries and translated back to XQuery fitting the different data
sources.
8.2.4
Using live remote data
The work in this thesis has been based on files already downloaded and
stored on local disk. This gives fast response times at the cost of occupied
84
8.2. Future work
disk space and risk of outdated data.
To make real use of the web, the data sources could be queried directly,
e.g. by using a caching strategy described in [CWR04] to overcome problem with response latency. This strategy caches queries and their results
and can answer some subsequent queries or parts of them by reasoning on
their containment relationships. This would however only be useful if the
XQuery solution was distributed so that the client communicates with the
remote server using XQuery instead of just downloading all data and then
performing the queries on the now local file. At the moment there seem to
be no database providers offering remote queries using XQuery.
8.2.5
XQuery graph support
As far as we know there is no special support for graph processing in
XQuery applications today. Graphs can be used to express a large number
of different types of data and there is no doubt graph data expressed in xml
format already today. xgmml (Extensible Graph Markup and Modeling
Language) [PK01] is a general format based on gml (that was used in this
thesis work) for describing graphs using xml. Other more specialized formats, such as the already existing psi-mi and sbml formats for biological
data, may emerge in a number of different areas.
So there is a need for graph-capable xml tools, especially in combination
with XQuery to filter results and so on. As showed in this thesis it is
possible to make simple graph applications using just ordinary XQuery
code. This is slow, however, and will only work for small graphs or for a
small number of steps.
To be able to perform larger graph searches there must be special support for this, either built-in in the tool or by using data binding capabilities
found in the tools. Qizx/open [Fra05a] and Exist [Exib] are among those
offering Java binding, making it possible to call Java methods in the same
way as XQuery’s internal functions.
A graph-support package for xml could contain common search algorithms such as depth-first search, breadth-first search and so on, as well as
an algorithm for finding all paths, as described in this thesis.
One problem with such a package is that, unless a standard graph
markup language such as xgmml is used, it must be told what elements
Discussion
85
and/or attributes to consider as nodes and edges.
8.2.6
User interface
The programs developed as a part of this thesis work all use the command
line interface. If an application for biologists were to be developed there
almost certainly would be a need for a traditional graphical user interface
or web-based interface. This interface would probably consists of a way of
entering searches for proteins (such as an id number or a describing name)
and the output in it’s simplest form would be a textual representation
of the pathways linked to further information on each passed protein. A
more advanced output would be a graphical representation of the resulting
pathways in graph form.
86
8.2. Future work
Bibliography
[AG]
Software AG. Tamino developer community - W3C XQuery
information [online, cited 21st October 2004]. Available
from: http://developer.softwareag.com/tamino/quip/
default.htm.
[asn04]
ASN.1 information site [online]. July 2004 [cited 7th October
2004]. Available from: http://asn1.elibel.tm.fr/en/.
[BBH03]
Gary D. Bader, Doron Betel, and Christopher W. V. Hogue.
BIND: the biomolecular interaction network database. Nucleic Acids Research, 31(1):248–250, 2003.
[bin]
BIND - the biomolecular interaction network [online, cited
7th October 2004]. Available from: http://bind.ca/.
[BL01]
Tim Berners-Lee. Notation 3 - ideas about web architecture
[online]. November 2001 [cited 23rd November 2004]. Available from: http://www.w3.org/DesignIssues/Notation3.
[Bot03]
Per Bothner. Qexo: The GNU Kawa implementation of
XQuery [online]. June 2003 [cited 21st October 2004]. Available from: http://www.gnu.org/software/qexo/.
[Bou04]
Ronald Bourret. XML and databases [online]. July 2004 [cited
5th October 2004]. Available from: http://www.rpbourret.
com/xml/XMLAndDatabases.htm.
87
88
BIBLIOGRAPHY
[CCHDV03] Inc. Clarkware Consulting and LLC Hunter Digital Ventures.
Bumblebee: The xquery test harness [online]. 2003 [cited
16th March 2005]. Available from: http://www.xquery.com/
bumblebee/.
[Cla01]
James Clark.
Relax NG DTD compatibility [online].
December 2001 [cited 5th April 2005]. Available from:
http://www.oasis-open.org/committees/relax-ng/
compatibility-20011203.html.
[Cla03]
James Clark. Relax NG home page [online]. September
2003 [cited 21st March 2005]. Available from: http://www.
relaxng.org/.
[Cok05]
Mike Cokus. XML binary characterization use cases [online].
March 2005 [cited 5th April 2005]. Available from: http:
//www.w3.org/TR/xbc-use-cases/.
[Con04]
International Human Genome Sequencing Consortium. Finishing the euchromatic sequence of the human genome. Nature, 431:931–945, October 2004.
[CRZ03]
Akmal B. Chaudhri, Awais Rashid, and Roberto Zicari, editors. XML Data Management. Addison-Wesley, 2003.
[Cun01]
Barbara A. Cunningham. Finding a mate. The Scientist,
15(7), April 2001.
Available from: http://www.tcd.
ie/Genetics/staff/Noel.Murphy/recombinant%20dna%
20ge4021/two%20hybrid%20systems.doc [cited 18th March
2005].
[CWR04]
Li Chen, Song Wang, and Elke A. Rundensteiner. Replacement strategies for XQuery caching systems. Data Knowl.
Eng., 49(2):145–175, 2004.
[Dar04]
Fraunhofer IPSI Darmstadt. IPSI-XQ - The XQuery demonstrator [online]. February 2004 [cited 21st October 2004].
Available from: http://www.ipsi.fraunhofer.de/oasys/
projects/ipsi-xq/index_e.html.
BIBLIOGRAPHY
[dbx03]
89
dbXML - native XML database [online]. 2003 [cited May 12th
2005]. Available from: http://www.dbxml.com/.
[DGvHW03] Yves Deville, David Gilbert, Jacques van Helden, and
Shoshana J. Wodak. An overview of data models for the
analysis of biochemical pathways. Brief Bioinform, pages
246–259, September 2003.
[dip04]
Database of interacting proteins [online]. 2004 [cited 8th April
2005]. Available from: http://dip.doe-mbi.ucla.edu/.
[Dum04]
Edd Dumbill. Notes and xqueries [online]. October 2004
[cited 16th November 2004]. Available from: http://www.
xml.com/pub/a/2004/10/20/deviant.html.
[EMA+ 04]
Andrew Eisenberg, Jim Melton, Michael Abbott, Per Bothner, Jason Hunter, and Brian Maso. JSR 225: XQuery API
for Java (XQJ) [online]. July 2004 [cited 6th October 2004].
Available from: http://www.jcp.org/en/jsr/detail?id=
225.
[EMB05]
EMBL-EBI. IntAct Annotation Rules (January 2005), January 2005.
[ent]
Entrez PubMed [online, cited 8th April 2005]. Available from:
http://www.ncbi.nlm.nih.gov/entrez/query.fcgi.
[exia]
Configuring indexes [online, cited 21st March 2005]. Available
from: http://exist-db.org/indexing.html.
[Exib]
Open source native xml database - xquery [online, cited 21st
April 2005]. Available from: http://exist-db.org/xquery.
html.
[exi05]
exist - open source native XML database [online]. 2005
[cited 12th May 2005]. Available from: http://exist.
sourceforge.net/.
90
BIBLIOGRAPHY
[FFRB03]
Michael Forster, Michael Forster, Marcus Raitner, and Christian Bachmaier. The graph template library [online]. 2003
[cited 12th May 2005]. Available from: http://infosun.
fmi.uni-passau.de/GTL/.
[FH03]
Andrew Finney and Michael Hucka. Systems biology markup
language (SBML) level 2: Structures and facilities for
model definitions [online]. June 2003 [cited 7th October
2003]. Available from: http://sbml.org/specifications/
sbml-level-2/version-1/html/.
[Fit02]
Michael Fitzgerald. Relax NG’s compact syntax [online]. June
2002 [cited 16th November 2004]. Available from: http://
www.xml.com/pub/a/2002/06/19/rng-compact.html.
[Fra05a]
Xavier Franc. Qizx programmer’s guide [online]. March 2005
[cited 21st April 2005]. Available from: http://www.xfra.
net/qizxopen/devguide.html.
[Fra05b]
Xavier Franc. Qizx/open [online]. March 2005 [cited
12th May 2005]. Available from: http://www.xfra.net/
qizxopen/root.html.
[gal04]
Galax: An implementation of XQuery [online]. September
2004 [cited 21st October 2004]. Available from: http://www.
galaxquery.org/.
[GFK04]
Maxim Grinev, Andrey Fomichev, and Sergey Kuznetsov.
Sedna: A native xml dbms. 2004. Available from: http:
//www.ispras.ru/~grinev/mypapers/sedna.pdf.
[Gro04]
XQuark Group. The XQuark project: open source information integration components based on XML and XQuery [online]. September 2004 [cited 21st October 2004]. Available
from: http://xquark.objectweb.org/.
[HFSB03]
Michael Hucka, Andrew Finney, Herbert Sauro, and Hamid
Bolouri. Systems biology markup language (SBML) level
BIBLIOGRAPHY
91
1:structures and facilities for basic model definitions [online]. August 2003 [cited 7th October 2004]. Available
from: http://sbml.org/specifications/sbml-level-1/
version-2/html/sbml-level-1.html.
[Him96]
Michael Himsolt. Sample graph in GML format [online].
April 1996 [cited 21st March 2005].
Available from:
http://infosun.fmi.uni-passau.de/Graphlet/GML/
example1.html.
[HM04]
Jason Hunter and Brett McLaughlin. JDOM v 1.0 [online].
2004 [cited 15th November 2004]. Available from: http://
www.jdom.org/docs/apidocs/.
[HMPB+ 04] H. Hermjakob, L. Montecchi-Palazzi, G. Bader, J. Wojcik,
L. Salwinski, A. Ceol, S. Moore, S. Orchard, U. Sarkans,
C. von Mering, B. Roechert, S. Poux, E. Jung, H. Mersch,
P. Kersey, M. Lappe, Y. Li, R. Zeng, D. Rana, M. Nikolski,
H. Husi, C. Brun, and K. Shanker. The HUPO PSI’s molecular interaction format–a community standard for the representation of protein interaction data. Nature Biotechnology,
22(2):177–83, 2004.
[Hol01]
Steven Holzner. Inside XML. New Riders, 2001.
[HUP]
Proteomics Standards Initiative HUPO.
Molecular interaction XML format documentation version 1.0 [online,
cited 5th October 2004]. Available from: http://psidev.
sourceforge.net/mi/xml/doc/user/.
[inf05]
Infonyte DB [online]. April 2005 [cited 12th May 2005]. Available from: http://www.infonyte.com/en/products.html.
[int05]
IntAct [online]. April 2005 [cited 8th April 2005]. Available
from: http://www.ebi.ac.uk/intact/index.jsp.
[ipe]
Ipedo XML database [online, cited 12th May 2005]. Available
from: http://www.ipedo.com/html/ipedo_xml_database.
html.
92
[ISO86]
BIBLIOGRAPHY
ISO. 8879:1986 standard general markup language (SGML),
1986.
[JAKC+ 02] H. V. Jagadish, S. Al-Khalifa, A. Chapman, L. V. S. Lakshmanan, A. Nierman, S. Paparizos, J. M. Patel, D. Srivastava,
N. Wiwatwattana, Y. Wu, and C. Yu. Timber: A native XML
database. The VLDB Journal, 11(4):274–291, 2002.
[jni04]
JNI - Java native interface [online]. 2004 [cited 8th March
2005]. Available from: http://java.sun.com/j2se/1.5.0/
docs/guide/jni/.
[Kat04]
Howard Katz. XQEngine [online]. February 2004 [cited 21st
October 2004]. Available from: http://www.fatdog.com/.
[Kay03]
Michael H. Kay. XML five years on: a review of the achievements so far and the challenges ahead. In Proceedings of the
2003 ACM symposium on Document engineering, pages 29–
31. ACM Press, 2003.
[KG00]
Minoru Kanehisa and Susumu Goto. KEGG: Kyoto encyclopedia of genes and genomes. Nucleic Acids Research,
28(1):27–30, 2000.
[Kim01]
Andrew Kimball. Command line transformations using
msxsl.exe [online]. September 2001 [cited 16th March 2005].
Available from:
http://msdn.microsoft.com/library/
default.asp?url=/library/en-us/dnxml/html/msxsl.
asp.
[Kra]
Curtis Krauskopf. Windows API timer with microsecond
(or better) accuracy [online, cited 9th March 2005]. Available from: http://www.decompile.com/cpp/faq/windows_
timer_api.htm.
[Lab]
Kanehisa Laboratory. KEGG markup language manual [online, cited 5th October 2004]. Available from: http://www.
genome.jp/kegg/docs/xml/.
BIBLIOGRAPHY
93
[Leh01]
Patrick Lehti. Design and implementation of a data manipulation processor for an XML query language. Master’s thesis,
Technische Universität Darmstadt, August 2001. Available
from: http://www.lehti.de/beruf/diplomarbeit.pdf.
[Les02]
Arthur M. Lesk. Introduction to Bioinformatics. Oxford University Press, 2002.
[LF04]
Patrick Lehti and Peter Fankhauser. XML data integration
with OWL: Experiences & challenges. In 2004 Symposium on
Applications and the Internet (SAINT 2004), 26-30 January
2004, Tokyo, Japan, pages 160–167. IEEE Computer Society,
2004. Available from: http://csdl.computer.org/comp/
proceedings/saint/2004/2068/00/20680160abs.htm.
[LLC05]
Xpriori LLC. Xpriori technology faq [online]. 2005 [cited
21st March 2005]. Available from: http://www.xpriori.
com/html/technology_faq1.html.
[LM00]
Andreas Laux and Lars Martin. XML:DB initiative: XUpdate - XML update language [online]. September 2000 [cited
6th October 2004]. Available from: http://xmldb-org.
sourceforge.net/xupdate/xupdate-wd.html.
[Man01]
Kevin T. Manley.
Time for Java [online].
August 2001 [cited 8th March 2005].
Available from:
http://www.fawcette.com/archives/premier/mgznarch/
javapro/2001/08aug01/km0108/km0108-2.asp.
[mar05]
Marklogic content interaction server [online]. 2005 [cited 12th
May 2005]. Available from: http://www.marklogic.com/
prod.html.
[MBK+ 00]
Didier Martin, Mark Birbeck, Michael Kay, Brian Loesgen,
Stephen Mohr, Jonathan Pinnock, Steven Livingstone, Peter
Stark, Kevin Williams, Richard Anderson, Nikola Ozu, Bruce
Peat, and David Baliles. Professional XML. Wrox Press,
2000.
94
[Mic04]
BIBLIOGRAPHY
Microsoft.
Queryperformancecounter function [online].
2004 [cited 9th March 2005].
Available
from:
http://msdn.microsoft.com/library/
en-us/winui/winui/windowsuserinterface/
windowing/timers/timerreference/timerfunctions/
queryperformancecounter.asp.
[min05]
Mint database [online]. March 2005 [cited 8th April 2005].
Available from: http://mint.bio.uniroma2.it/mint/.
[MOD04]
MODIS ISPRAS. Sedna Programmer’s Guide, 2004. Available from: http://www.modis.ispras.ru/Development/
Documentation/Sedna/ProgGuide.pdf.
[NCB03]
XML at NCBI [online]. May 2003 [cited 7th October
2004]. Available from: http://www.ncbi.nlm.nih.gov/IEB/
ToolBox/XML/.
[neo05]
Neocore XMS [online]. 2005 [cited 12th May 2005]. Available
from: http://www.xpriori.com/html/products.html.
[oC]
University of California. XIN fomat [sic!] [online, cited 5th
October 2004]. Available from: http://dip.doe-mbi.ucla.
edu/dip/Guide.cgi?SM=0:3.
[PK01]
John Punin and Mukkai Krishnamoorthy. Xgmml (extensible
graph markup and modeling language) [online]. August 2001
[cited 21st April 2005]. Available from: http://www.cs.rpi.
edu/~puninj/XGMML/.
[Pro02a]
Will Provost. Normalizing xml, part 1 [online]. November
2002 [cited 15th November 2004]. Available from: http://
www.xml.com/lpt/a/2002/11/13/normalizing.html.
[Pro02b]
Will Provost. Normalizing xml, part 2 [online]. December
2002 [cited 15th November 2004]. Available from: http://
www.xml.com/lpt/a/2002/12/04/normalizing.html.
BIBLIOGRAPHY
95
[rea]
Reactome - a knowledgebase of biological processes [online, cited 18th March 2005]. Available from: http://www.
reactome.org/.
[Rel01]
Relax ng specification [online]. December 2001 [cited 16th
November 2004]. Available from: http://relaxng.org/
spec-20011203.html.
[rel02]
Relax NG compact syntax [online]. November 2002 [cited
16th November 2004]. Available from: http://relaxng.org/
compact-20021121.html.
[Rou02]
Vladimir Roubtsov. Profiling CPU usage from within a Java
application [online]. November 2002 [cited 8th March 2005].
Available from: http://www.javaworld.com/javaworld/
javaqa/2002-11/01-qa-1108-cpu.html.
[Sam04]
Samuel. Intact - statistics [online]. September 2004 [cited
18th October 2004]. Available from: http://www.ebi.ac.
uk/intact/doc/html/statistics/index.html.
[SAX]
About SAX [online, cited 5th October 2004]. Available from:
http://www.saxproject.org/.
[sed05]
Sedna - native XML DBMS [online]. 2005 [cited 12th
May 2005]. Available from: http://www.modis.ispras.ru/
Development/sedna.htm.
[Smi01]
Michael Smith.
Relax NG and Docbook [online].
December 2001 [cited 5th April 2005].
Available
from: http://sources.redhat.com/ml/docbook/2001-12/
msg00060.html.
[son05]
Sonic XML server [online].
2005 [cited 12th May
2005]. Available from: http://www.sonicsoftware.com/
products/sonic_xml_server/index.ssp.
[Sri]
Rahul Srivastava. XML Schema: Understanding namespaces
[online, cited 16th November 2004].
Available from:
96
BIBLIOGRAPHY
http://www.oracle.com/technology/pub/articles/
srivastava_namespaces.html.
[Sta01]
Kimbro (editor) Staken. XML database API draft [online]. September 2001 [cited 6th October 2004]. Available from:
http://xmldb-org.sourceforge.net/xapi/
xapi-draft.html.
[Str03]
Lena Strömbäck. XML representations of pathway data: a
comparison. July 2003.
[swi05]
Expasy - swiss-prot and trembl [online]. March 2005 [cited
6th April 2005]. Available from: http://www.expasy.org/
sprot/.
[SWK+ 01]
Albrecht Schmidt, Florian Waas, Martin Kersten, Daniela
Florescu, Michael J. Carey, Ioana Manolescu, and Ralph
Busse. Why and how to benchmark XML databases. SIGMOD Rec., 30(3):27–32, 2001.
[tam04]
Tamino overview [online]. 2004 [cited 12th May 2005].
Available from: http://www1.softwareag.com/corporate/
products/tamino/default.asp.
[tim04]
Timber [online]. July 2004 [cited 12th May 2005]. Available
from: http://www.eecs.umich.edu/db/timber/.
[Twy03]
Richard Twyman. The Wellcome trust: Protein interaction
mapping [online]. January 2003 [cited 18th March 2005].
Available from: http://www.wellcome.ac.uk/en/genome/
thegenome/hg03f001.html.
[vdV01]
Eric van de Vlist. Comparing XML schema languages
[online].
December 2001 [cited 16th November 2004].
Available from: http://www.xml.com/pub/a/2001/12/12/
schemacompare.html.
[Ven03]
Bill Venners. What’s wrong with XML APIs [online]. May
2003 [cited 15th November 2004]. Available from: http://
www.artima.com/intv/xmlapisP.html.
BIBLIOGRAPHY
97
[vir04]
Openlink universal integration middleware - virtuoso product
family [online]. 2004 [cited 12th May 2005]. Available from:
http://virtuoso.openlinksw.com/.
[W3C95]
W3C.
Hypertext markup language - 2.0, November
1995. Available from: http://www.rfc-editor.org/rfc/
rfc1866.txt [cited 5th October 2004].
[W3C98]
W3C. Extensible markup language (XML) 1.0, February 1998. Available from: http://www.w3.org/TR/1998/
REC-xml-19980210 [cited 5th October 2004].
[W3C99a]
W3C. Namespaces in XML [online]. January 1999 [cited
16th November 2004]. Available from: http://www.w3.org/
TR/1999/REC-xml-names-19990114/.
[W3C99b]
W3C. XML path language (XPath) version 1.0 [online].
November 1999 [cited 7th October 2004]. Available from:
http://www.w3.org/TR/1999/REC-xpath-19991116.
[W3C99c]
W3C. XSL transformations (XSLT) version 1.0 [online].
November 1999 [cited 7th October 2004]. Available from:
http://www.w3.org/TR/1999/REC-xslt-19991116.
[W3C01a]
W3C. XML linking language (XLink) version 1.0 [online].
June 2001 [cited 7th October 2001]. Available from: http:
//www.w3.org/TR/2001/REC-xlink-20010627/.
[W3C01b]
W3C.
XML schema part 0:
Primer, May 2001.
Available
from:
http://www.w3.org/TR/2001/
REC-xmlschema-0-20010502/.
[W3C03]
W3C. XML syntax for XQuery 1.0 (XQueryX) - W3C
working draft [online]. December 2003 [cited 7th October 2004]. Available from: http://www.w3.org/TR/2003/
WD-xqueryx-20031219/.
[W3C04a]
W3C. Document object model (DOM) level 3 core specification, April 2004. Available from: http://www.w3.org/TR/
2004/REC-DOM-Level-3-Core-20040407/.
98
BIBLIOGRAPHY
[W3C04b]
W3C. The extensible stylesheet language family (XSL) [online]. July 2004 [cited 7th October 2004]. Available from:
http://www.w3.org/Style/XSL/.
[W3C04c]
W3C.
OWL web ontology language overview, February 2004. Available from: http://www.w3.org/TR/2004/
REC-owl-features-20040210/.
[W3C04d]
W3C. RDF vocabulary description language 1.0: RDF
schema, February 2004. Available from: http://www.w3.
org/TR/2004/REC-rdf-schema-20040210/.
[W3C04e]
W3C. RDF/XML syntax specification (revised), February 2004. Available from: http://www.w3.org/TR/2004/
REC-rdf-syntax-grammar-20040210/ [cited 5th October
2004].
[W3C04f]
W3C. XQuery 1.0: An XML query language - W3C
working draft [online].
July 2004 [cited 7th October
2004].
Available from: http://www.w3.org/TR/2004/
WD-xquery-20040723/.
[Wor04]
BioPAX Workgroup. BioPAX - biological pathways exchange
language level 1, version 1.0 documentation, July 2004. Available from: http://www.biopax.org/Downloads/Level1v1.
0/biopax-level1.zip.
[X-H04]
X-Hive Corporation B.V. X-Hive/DB 6.1 - Manual, 2004.
[xhi05]
X-Hive/DB [online]. 2005 [cited 12th May 2005]. Available
from: http://www.x-hive.com/products/db/index.html.
[xin]
Apache Xindice [online].
apache.org/xindice/.
[xml05]
XML binary characterization working group public page [online]. March 2005 [cited 8th April 2005]. Available from:
http://www.w3.org/XML/Binary/.
Available from:
http://xml.
BIBLIOGRAPHY
[YzK04]
99
Benjamin Bin Yao, M. Tamer Özsu, and Nitin Khandelwal.
Xbench benchmark and performance testing of XML DBMSs.
In Proceedings of 20th International conference on Data Engineering. Boston, MA., pages 621–632, March 2004.
Index
+, 7, 32, 45, 52
Exist, 60
Sedna, 60
X-Hive, 61
computational models, 40
amino acid, 37
API
Data-binding, 18
Pull, 17
Push, 17
Query, 18
Tree-based, 18
ASN.1, 47
data binding, 84
data integration, 83
Data-binding API, 18
data-centric documents, 9
Database models, 40
DIP, 41
DNA, 37
document-centric documents, 9
DTD, 10
bait, 39, 45
benchmarking
C++ program, 73
XML databases, 66
BIND, 42, 47
bioinformatics, 37
databases, 40
exchange formats, 47
standard formats, 43
Summary, 48
BioPAX, 46
Exist, 60
false positives, 40
FLWOR, 22
gene, 37
GML (Graph Markup Language),
56, 70
grammar, 6
graph-based models, 40
GTL (Graph Template Library),
56, 69
C++, 56
caching, 75, 77, 84
collection, 30
100
INDEX
HTML (Hypertext Markup Language), 5
HUPO, 45, 46
indexing
Exist, 60
Sedna, 60
X-Hive, 61
indices, 30
IntAct, 42
Java, 55, 61
JNI (Java Native Interface), 66
KEGG, 41
KGML, 47
Links
XLink, 17
metadata, 15
MINT, 42
namespace, 14
Native XML databases, 27
definition, 28
normalization, 31
NXD implementations, 33
dbXML, 33
Exist, 34
Infonyte DB, 34
Ipedo XML database, 34
MarkLogic Content Interaction Server, 34
Neocore XMS, 34
Sedna, 34
Sonic XML Server, 34
101
Tamino, 34
Timber, 34
Virtuoso, 35
X-Hive DB, 35
Xindice, 35
NXD models
Model-based, 33
Persistent DOM, 33
Text-based, 33
OWL (Web Ontology Language),
16, 83
pathway, 38
metabolic, 38
regulatory, 38
signal transduction, 38
performance, 32
phage-display systems, 39
prey, 39, 45
protein, 38
protein interactions, 38
experimental methods, 39
PSI MI, 45
Pull API, 17
Push API, 17
Qizx/open, 61
queries, 53
Query API, 18
RDF (Resource Description Framework), 16
Reactome, 42
recursion, 63
referential integrity, 32
102
relational database, 27
Relax NG, 12
results
IntAct data, 75
Reactome data, 77
round-tripping, 30, 65
SBML, 43
Sedna, 34, 60
semi-structured, 9, 29
XML serialization, 65, 76
SGML (Standard Generalized Markup
Language), 5
SWQL (Semantic Web Query Language), 83
Tamino, 25, 28, 34
test framework, 65
Timber, 32, 34
transformation, 19, 70
from SBML to GML, 69
XSLT, 19
Tree-based API, 18
two-hybrid systems, 39
user interface, 85
validation, 6, 10
DTD, 10
Relax NG, 12
XML Schema, 11
well-formed, 6
X-Hive, 61
X-Hive DB, 35
XGMML, 84
INDEX
XIN, 47
XML, 5
advantages, 7
binary, 48
drawbacks, 8
Summary, 35
XML database, 26
XML-enabled databases, 27
XML databases
Hybrid XML databases, 28
Native XML databases, 27
XML Schema, 11
XML:DB API, 32
XPath, 19
XQJ (XQuery API for Java), 32
XQuery, 21
XQuery compliance, 79
XQuery implementations, 25
Galax, 26
IPSI-XQ, 25
Qexo, 26
Qizx/open, 26
QuiP, 25
Saxon, 26
XQEngine, 25
XQJ, 26
XQuark, 25
XQueryX, 25
XUpdate, 21
Appendix A
System specifications
A.1
Software
ˆ Exist Native xml Database v. 1.0 beta 1
ˆ gtl v. 1.2.1
ˆ Java J2SE build 1.5.0-b64
ˆ Microsoft Visual Studio .net 2003
ˆ Microsoft Windows xp Professional, version 2002, Service Pack 2
ˆ Microsoft xslt Processor version 4.0
ˆ Qizx/open 0.4p2
ˆ Sedna pre-release version v. 0.3 dated 2005-01-21
ˆ X-Hive db 6.1
A.2
Hardware
IBM X40 with Intel Pentium M processor 1200 MHz and 512 MiB ram.
103
104
A.2. Hardware
Appendix B
Figures
105
106
Figure B.1: sbml structure
107
Figures
Figure B.2: psi mi structure
108
Figure B.3: kgml structure
Appendix C
File listings
C.1
Transformation
Listing C.1: sbml to gml transformation
<?xml version="1.0"?>
<xsl:stylesheet version="1.0"
xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
xmlns:sbml="http://www.sbml.org/sbml/level2"
xmlns:html="http://www.w3.org/1999/xhtml">
<xsl:output method="text" encoding="iso-8859-1" indent="no"/
>
<xsl:template match="/">graph [
directed 1
<xsl:apply-templates select="//sbml:species"/>
<xsl:apply-templates select="//sbml:reaction"/>
]
</xsl:template>
<xsl:template match="sbml:species">node [
id <xsl:value-of select="substring-before(substring-after(
@id,’_’),’_’)" />
109
110
C.2. IntAct XQueries
label &quot;<xsl:value-of select="@id"/>&quot;
]
</xsl:template>
<xsl:template match="//sbml:reaction">
<xsl:for-each
select="sbml:listOfReactants/sbml:speciesReference">
<xsl:variable name="sourceid">
<xsl:value-of
select="substring-before(substring-after(@species,’_
’),’_’)" />
</xsl:variable>
<xsl:for-each
select="../../sbml:listOfProducts/sbml:speciesReference
">
edge [
source <xsl:value-of select="$sourceid"/>
target <xsl:value-of
select="substring-before(substring-after(@species,’
_’),’_’)" />
]
</xsl:for-each>
</xsl:for-each>
</xsl:template>
</xsl:stylesheet>
C.2
IntAct XQueries
Listing C.2: IntAct query 1 - Protein info
(: Query 1 - IntAct:)
(: Find information on a given protein. Protein id is given.
:)
declare namespace p = "net:sf:psidev:mi";
File listings
111
document("/psimi/intact/rat_small.xml")//p:proteinInteractor[
@id="EBI-77471"]
Listing C.3: IntAct query 2 - Experiment info
(: Query 2 - IntAct:)
(: Find information on a given experiment description. :)
(: Experiment description id is given. :)
declare namespace p = "net:sf:psidev:mi";
document("/psimi/intact/rat_small.xml")//p:
experimentDescription[@id="EBI-77457"]
Listing C.4: IntAct query 3 - Interaction info
(: Query 3a - IntAct:)
(: Find information on a given interaction. Shortlabel is
given:)
declare namespace p = "net:sf:psidev:mi";
document("/psimi/intact/rat_small.xml")//p:interaction[p:names
/p:shortLabel="dlgap1-dlg1-1"]
Listing C.5: IntAct query 4 - Protein interaction info
(: Query 4a - IntAct:)
(: Find the protein information for the proteins that are part
:)
(: of a given interaction. Interaction id is given. :)
declare namespace p = "net:sf:psidev:mi";
for $ref in document("/psimi/intact/rat_small.xml")//p:
interaction[p:names/p:shortLabel="dlgap1-dlg1-1"]/p:
participantList/p:proteinParticipant/p:
proteinInteractorRef/@ref
return document("/psimi/intact/rat_small.xml")//p:
proteinInteractor[@id=$ref]
112
C.2. IntAct XQueries
Listing C.6: IntAct query 5 - Experiment interaction info
(:
(:
(:
(:
Query 5 - IntAct:)
Find the experiment description information for a :)
experiment description that is part of an interaction. :)
Interaction id is given. :)
declare namespace p = "net:sf:psidev:mi";
document("/psimi/intact/rat_small.xml")//p:
experimentDescription[@id=document("/psimi/intact/
rat_small.xml")//p:interaction[p:names/p:shortLabel="chpslc9a1"]/p:experimentList/p:experimentRef/@ref]
Listing C.7: IntAct query 6 - Interactions for protein
(: Query 6 - IntAct:)
(: Find all interactions that a given protein is part of. :)
(: Protein id is given. :)
declare namespace p = "net:sf:psidev:mi";
document("/psimi/intact/rat_small.xml")//p:interaction[p:
participantList/p:proteinParticipant/p:
proteinInteractorRef/@ref="EBI-77480"]
Listing C.8: IntAct query 7 - Interactions for experiment
(: Query 7 - IntAct:)
(: Find all interactions that a given experiment description
is :)
(: part of. Experiment description id is given. :)
declare namespace p = "net:sf:psidev:mi";
document("/exjobb/dbexs/psimi/intact/rat_small.xml")//p:
interaction[p:experimentList/p:experimentRef/@ref="EBI
-77457"]
Listing C.9: IntAct query 8 - Interactions for proteins
(: Query 8 - IntAct:)
File listings
113
(: Find any interactions that two given proteins are a part of
. :)
(: Protein id for the two proteins are given.:)
declare namespace p = "net:sf:psidev:mi";
document("/psimi/intact/rat_small.xml")//p:interaction[p:
participantList/p:proteinParticipant/p:
proteinInteractorRef/@ref="EBI-77460" and p:
participantList/p:proteinParticipant/p:
proteinInteractorRef/@ref="EBI-77480"]
Listing C.10: IntAct query 9 - Count proteins
(: Query 9 - IntAct:)
(: Count number of proteins in document.:)
declare namespace p = "net:sf:psidev:mi";
count(document("/psimi/intact/rat_small.xml")//p:
proteinInteractor)
Listing C.11: IntAct query 10 - Count interactions
(: Query 10 - IntAct:)
(: Count number of interactions in document.:)
declare namespace p = "net:sf:psidev:mi";
count(document("/psimi/intact/rat_small.xml")//p:interaction)
C.3
Reactome XQueries
Listing C.12: Reactome query 1 - Compartment info
(: Query 1 - Reactome:)
(: Find all information on a given compartment. :)
(: Compartment id is given. :)
declare namespace s = "http://www.sbml.org/sbml/level2";
114
C.3. Reactome XQueries
document("/sbml/reactome/sbml.xml")//s:compartment[@id="
R_cytosol"]
Listing C.13: Reactome query 2 - Species info
(: Query 2 - Reactome:)
(: Find all information on a given species. :)
(: Species id is given. :)
declare namespace s = "http://www.sbml.org/sbml/level2";
document("/sbml/reactome/sbml.xml")//s:species[@id="
R_109266_5_nucleotidase_ecto_CD73_holoenzyme"]
Listing C.14: Reactome query 3 - Reaction info
(: Query 3 - Reactome:)
(: Find all information on a given reaction. :)
(: Reaction id is given. :)
declare namespace s = "http://www.sbml.org/sbml/level2";
document("/sbml/reactome/sbml.xml")//s:reaction[@id="R_109278"
]
File listings
115
Listing C.15: Reactome query 4 - Outgoing way
(: Query 4 - Reactome:)
(: Given a reaction id, find all information on :)
(: those reactions that have a protein listed as reactant that
:)
(: is also listed as a product in the given reaction. :)
declare namespace s = "http://www.sbml.org/sbml/level2";
declare function local:findMolecule($molecule as xs:string,$n
as xs:integer) {
for $i in document("/sbml/reactome/sbml.xml")//s:reaction[s:
listOfReactants/s:speciesReference/@species = $molecule
]/s:listOfProducts/s:speciesReference/@species
return <item reaction="\{$i/parent::node()/parent::node()/
parent::node()/@id}">{$i} {
if ($n = 1)
then <max/>
else local:findMolecule($i, $n - 1)}
</item>
};
<path>{local:findMolecule("R_76178_Crotonoyl_ACP",4)}</path>
Listing C.16: Reactome query 5 - Pathway search
(: Query 5 - Reactome:)
(: Given two proteins find out if there is a given pathway :)
(: between them. Protein ids are given for the two proteins.
:)
declare namespace s = "http://www.sbml.org/sbml/level2";
declare function local:findMolecule($molecule as xs:string,
$goalMol as xs:string, $n as xs:integer) {
116
C.3. Reactome XQueries
for $i in document("/sbml/reactome/sbml.xml")//s:reaction[s:
listOfReactants/s:speciesReference/@species = $molecule
]/s:listOfProducts/s:speciesReference/@species
return <item>{$i} {
if($i = $goalMol)
then <found/>
else if($i = $molecule)
then <loop />
else if ($n = 1)
then <max/>
else local:findMolecule($i, $goalMol, $n - 1)}
</item>
};
<path>{local:findMolecule("R_109276_H2O","R_83910_sodium_ion"
,2)}</path>
Listing C.17: Reactome query 5 - Pathway search (alternative)
declare namespace s = "http://www.sbml.org/sbml/level2";
declare function local:findMolecule($molecule as xs:string,
$goalMol as xs:string, $n as xs:integer) {
for $i in document("/sbml/reactome/sbml.xml")//s:reaction[s:
listOfReactants/s:speciesReference/@species = $molecule
]/s:listOfProducts/s:speciesReference/@species
return <item reaction="{$i/parent::node()/parent::node()/
parent::node()/@id}">{$i} {
if($i = $goalMol)
then <found/>
else if($i = $molecule)
then <loop />
else if ($n = 1)
then <max/>
else local:findMolecule($i, $goalMol, $n - 1)}
File listings
117
</item>
};
declare function local:getParent($itema as node()) {
if($itema[@species])
then
<e>{local:getParent($itema/parent::node())}<node>{$itema
/@species}{$itema/@reaction}</node></e>
else <top/>
};
<pathdoc>
{
if(count(document("/sbml/reactome/sbml.xml")//s:species[@id="
R_109276_H2O"]) and count(document("/sbml/reactome/sbml.
xml")//s:species[@id="R_83910_sodium_ion"]))
then <pex>{
let $t := local:findMolecule("R_109276_H2O","
R_83910_sodium_ion",2)
return <result>
<pathsfound>{count($t//found)}</pathsfound>
<paths>
{
for $b in $t//item[found]
return
<path>{
local:getParent($b)
}
</path>
}
</paths>
</result>
}</pex>
else <notfound/>
}
</pathdoc>
118
C.3. Reactome XQueries
Listing C.18: Reactome query 6 - Count species
(: Query 6 - Reactome:)
(: Count the number of species in the database. :)
declare namespace s = "http://www.sbml.org/sbml/level2";
count(document("/sbml/reactome/sbml.xml")//s:species)
Listing C.19: Reactome query 7 - Count reactions
(: Query 7 - Reactome:)
(: Count the number of reactions in the database. :)
declare namespace s = "http://www.sbml.org/sbml/level2";
count(document("/sbml/reactome/sbml.xml")//s:reaction)
LINKÖPING UNIVERSITY
ELECTRONIC PRESS
Copyright
Svenska
Detta dokument hålls tillgängligt på Internet - eller dess framtida ersättare - under 25 år från
publiceringsdatum under förutsättning att inga extraordinära omständigheter uppstår.
Tillgång till dokumentet innebär tillstånd för var och en att läsa, ladda ner, skriva ut
enstaka kopior för enskilt bruk och att använda det oförändrat för ickekommersiell forskning
och för undervisning. Överföring av upphovsrätten vid en senare tidpunkt kan inte upphäva
detta tillstånd. All annan användning av dokumentet kräver upphovsmannens medgivande.
För att garantera äktheten, säkerheten och tillgängligheten finns det lösningar av teknisk och
administrativ art.
Upphovsmannens ideella rätt innefattar rätt att bli nämnd som upphovsman i den omfattning som god sed kräver vid användning av dokumentet på ovan beskrivna sätt samt skydd
mot att dokumentet ändras eller presenteras i sådan form eller i sådant sammanhang som är
kränkande för upphovsmannens litterära eller konstnärliga anseende eller egenart.
För ytterligare information om Linköping University Electronic Press se förlagets hemsida http://www.ep.liu.se/
English
The publishers will keep this document online on the Internet - or its possible replacement for a period of 25 years from the date of publication barring exceptional circumstances.
The online availability of the document implies a permanent permission for anyone to
read, to download, to print out single copies for your own use and to use it unchanged for any
non-commercial research and educational purpose. Subsequent transfers of copyright cannot
revoke this permission. All other uses of the document are conditional on the consent of the
copyright owner. The publisher has taken technical and administrative measures to assure
authenticity, security and accessibility.
According to intellectual property law the author has the right to be mentioned when
his/her work is accessed as described above and to be protected against infringement.
For additional information about the Linköping University Electronic Press and its procedures for publication and for assurance of document integrity, please refer to its WWW
home page: http://www.ep.liu.se/
©Linköping,
David Hall
9th June 2005
Fly UP