Chris Wallace, University of the West of England, Bristol
chris.wallace@uwe.ac.uk
This paper will examine the potential of Native XML databases (NXD) in the construction of information systems, and explore the design space created by this technology.
The author's current project is the development of a University Faculty information system to provide information on the faculty's teaching programme, staffing, organisational structure and the administrative processes which constrain and support academic life.
This paper sets out the background to the project, an overview of the technology chosen and then explores a number of design issues which arise when trying to map problem to solution. The task of mapping the conceptual model into the constructs provided by XML and the chosen NXD, and the choice of supporting tools are both explored.
Progress on the project so far confirms the choice of an NXD for this application, with benefits in speed of development and direct representation of user structures.
The faculty of Computing, Engineering and Mathematical Sciences (CEMS) at the University of the West of England (UWE) is one of ten faculties in the University. In total, UWE has around 27,000 students and 2,700 staff. Within CEMS alone, we have 300 staff and 3000 students. The faculty offers 100 or so programmes of study, teaches nearly 500 individual modules, sets 1000 units of assessment (an exam or assignment) and assesses, at a guess, 30,000 pieces of student work per annum.
Within the University, central IT services provide systems for personnel, finance, student records, time-tabling, e-learning and other corporate functions. However support for the academic processes within a faculty, such as provision of information on the teaching programme and support for the many administrative processes, such as managing the production, moderation, printing and marking of an exam paper, is currently devolved to the individual faculties. Some are more computerised than others and we are paying the penalty for being early in the development cycle and now have a collection of rather old, isolated silos of information, based on various text, LDAP and MySQL data stores processed with PHP into HTML and PDF. Update was controlled through a webmaster except for small amounts of staff data. As a consequence, many administrative units have developed their own information systems based on a variety of tools, predominantly Microsoft Access and Excel.
This project is thus essentially a rework of an existing requirement in response to identified short-comings. These include: lack of integration within and between the datasets and with UWE systems; centralised control of data; lack of support for faculty processes; replicated data and poor data quality and usage. A faculty project was initiated to address these and other needs and the outcome is a dual development using XML for managed, structured data called FOLD (Faculty OnLine Data) and an open-source Content Management System (Plone) for unstructured documents and ad-hoc collaboration.
Although the current scope of the system includes a score of datasets, information on modules and programs is core to the system. A Module is the smallest unit in which credit is awarded to a student, of varying size from the equivalent of 1/12 of a years work to a full year. Students enrol on a Programme of study which determines the award they will receive on completion and defines the core and optional modules to be taken at each stage.
Traditionally, this has been a document-centred system. The definitive specifications for both modules and programmes are Word documents, with a standardized structure for most textual data with diagrams for more complex relationships. Examples can be seen on the website for this workshop (see below).
In this project we aim to retain the document-centric perspective for ownership and as units for editing, whilst supported a data-centric view to support processing and linking between these documents, and the data required to describe the rest of the Faculty. Figure 1 shows a rough data model for the Module and Programme sub-system.
Some non-functional properties of this system are noteworthy. Frequency of access is quite low, and updates infrequent. For example, new versions of modules and programmes are restricted (by contractual relationships with the student) to once a year, and only a small proportion will change at all. The format for module specifications and coding systems change as frequently as the data itself.
As a simple information system, the main processing required is to enter, update, query and retrieve the data in various forms. However the long-term intention is to grow both the scope of the data model and support for process management and workflow. Little work has yet been done in this area beyond process support through form generation linked to process descriptions, but since process modelling languages are typically XML-based, an XML repository provides scope for combining data and process descriptions. Another area of interest is visualisation to help managers understand the gross behaviour of this complex system.

Figure 1: Modules and Programmes
It would be false to present this case as a fully reasoned progression from problem to solution. The truth is that the author had a hunch that XML and a native XML database could be the basis of a different kind of information system. An initial pilot study showed promise, and past experience with a pre-XML tagged system using Perl and flat files was motivation enough to want to do it properly in XML.
With this decision made, the task is to discover how a new (at least to the author) technology can be exploited to solve the problem. 'Solve' gives little of the flavour of the project. Donald Schon's phase - 'design as a conversation with the materials of the situation' [4] far better captures the process of concurrent learning about problem and solution.
The development in the last few years of native XML databases (NXD) has been very rapid. Ronald Bourret [1] lists over 40 NXD including Berkley Sleepycat, Tamino from Software AG, Xindice from Apache and eXist, an open-source project. An NXD allows any XML document to be stored, retrieved, updated and queried as an XML document without shredding into multiple, linked tables, or holding merely as text in a BLOB as is the case with an XML-enabled Relational Database such as Oracle. Unlike an XML-enabled RDBMS, an NXD does not need a definition of the structure, but stores each document in a tree structure which balances the need for reconstruction and querying. The greater knowledge of element types and locations which is provided by the schema would allow faster indexing but this benefit may be outweighed by the difficulty of schema definition for some document types, the problem of handling heterogeneous collections of documents and schema evolution. One penalty is that casting strings to datatypes must be explicit.
A key component of an NDX is support for a query language designed for XML. This language is XQuery, a functional language which is built on XPath and supports the querying and construction of XML documents. Its functionality overlaps with that of XSLT, a declarative language for transforming XML documents into XML and other textual formats.
The eXist NXD is an open source development of a project by Wolfgang Meier [2]. The core development team are mainly European with Meier as the lead architect. Versions have been in use for several years and development work is very active. The software has been essentially stable and in use on a number of projects worldwide.
eXist is coded in Java and can be run either embedded in a Java Application or stand-alone in a number of configurations. Database management can be carried out either by a web interface or via a Java client which provides a small but useful development environment, allowing XQuery scripts to be edited or executed insitu.
eXist provides a hierarchical filestore in which documents (files) are held in collections (folders). XML documents, including XSLT, are parsed, loaded into a B+ tree and indexed, whilst binary resources including XQuery, CSS style sheets and images are stored unparsed. Queries are written in XQuery, and queries are just-in-time compiled and cached. Typically in XQuery, node sequences are selected in the database by XPath expression, either by default over the whole database or by restricting the selection to a document or a collection. The XQuery function collection() accesses a named directory and all its sub-directories. eXist also provides a version xcollection() which restricts the dataset to the named directory only. External data sources identified by a URL, perhaps dynamically generated, are also accessible.
XQuery has extended in the eXist implementation, both by somewhat controversially changes to the XQuery language itself and through function libraries to provide support for free-text matching and a simpler syntax than XUpdate for insitu updates. Additional module libraries and database functionality provide support for support for browser interaction and sessions, user and database management, transactions and triggers. The result is a powerful server-side language which combines a conventional scripting language such as JSP or PHP with a database query language. So far this project has been developed entirely in XQuery and XSLT.
The overall design space is shaped by the nature of the problem and the available technology. This paper attempts to explore some of this space through a series of design issues or decision points. Many issues are still open and are little more than a placeholder to prompt discussion. Many are common to all information system development but raise slightly different issues in the context of this problem/solution pair.
We start with the central problem of how to map a conceptual model, such as a class model, into the constructs afforded by an NDX in general and eXist in particular, then look at some of the support systems needed to couple the database to the users.
Figure 1 shows a partial conceptual class model of the module and programme sub-system. Mapping the conceptual model into a number of XML schemas, and into the database directory structure presents a number of choices. The following sections discuss these decisions.
The central task is to map the conceptual model into one or more XML documents. The structure of each is described in a schema, typically written in XML schema. Composite relationships are central to the identification of candidate documents. Figure 1 shows a large number of composite relationships. In principle, one could start at the root of a composition hierarchy and create a composite entity type by taking the closure of these composition relationships. Applied mechanically to the module hierarchy, this would lead to a module document containing all versions. A better match to current practice is to place the root of this schema at Module Specification, yielding a ModuleSpecification document which matches the current practice. The conceptual model alone seems to provide insufficient information to guide the mapping to Documents, but it is not clear what other factors influence this decision.
The directory structure of eXist provides another composition mechanism. It is not entirely clear how this structure should be used. Possibilities include partitioning by document type for easy of management, by ownership to support access rights, by version to avoid document name complexity etc. Although directory design is a problem which faces us every day on company or personal file stores, it is still a non-trivial task to decide on the organisation of this tree structure since its structure is essentially arbitrary.
Moreover, documents may be structured to be containers for multiple instances of a more basic Entity type, aggregated for some administrative purpose. It is thus useful to think of the identification of an Entity Type as a semantically meaningful unit rather than the containing document. This is the terminology which will be used in the remainder of the paper.
Weaker relationships must be mapped into references in one Entity to another. Whereas in a Relational Model, this would require the identifier (primary key) from the one side to be added to the many-side (as a foreign key), in an XML database as in OO, the reference can be held either side, as a single or multi-valued element, depending on which side is most salient. This decision can be indicated by the visibility annotation, but it requires judgment to decide. One indication is the presence of additional structure to the references (simple sequencing to the complex expressions of pre-requisites). XML allows such structures to be directly represented.
Whilst navigation within a composite entity is by path expression, navigation between associated entities is by means of a join expression. The generalised equality operator of XPath 2.0 on which XQuery is based is essential here. Thus to locate the set of programme codes that have modules with credits = 30, the XPath expression for this join is:
distinct-values(//ProgrammeStructure
[*//moduleCode = ( //ModuleSpecification
[credits=30]
/ModuleCode
)
]
/programmeCode
)
The equivalent in SQL is left as an exercise to the reader.
Identifiers may be based on names in the problem domain or be system-generated surrogates, as in OO. In defiance of the normal warnings about the use of naturally occurring names as identifiers, we have opted to try to use natural names where such exist, partly on the grounds of reducing the impedance between model and real world. In this information system the faculty can be regarded as a natural single 'namespace'. If there is a collision of names for example two members of staff with the same name, then this is a problem within the community which would or should be resolved in the real world, since this leads to confusion in natural communication, for example in the minutes of meetings, in a phone list etc, just as it would in the database. Typically this resolution is by agreeing to change one name, or to use a variant. When a new member of staff joins, the system should identify this clash but allow the two members to co-exist. The software can still distinguish the separate entities as distinct nodes in the database, but any access should show the two instances, and allow the user to decide which was intended. If and when the pilot system is expanded to the university, names qualified by faculty will be needed. Clearly URN's have a role here and will be explored in a future development cycle.
Another aspect is whether an entity can have multiple identifiers. In real life, people do have multiple names - women change names on marriage, so that documents prior to marriage use their maiden name whilst those after marriage may use their married name. The person schema allows multiple names for a person, so that references from existing and new documents will still be valid.
XML, in allowing for two kinds of data representation, presents the designer with a difficult task, for which there seems to be little agreement in the literature. Guidelines which suggest attributes are meta-data to elements seem weak. Writers argue forcefully on either side.
In this project, attributes are not used. This is partly because database size was not judged to be an issue for this application and partly because editing attributes in Word, Excel and InfoPath, three editors which are used on the project, is more difficult then elements. Arguably it is also easier to read a nested element structure than an unordered string of attributes, and element based schemas are more flexible. However attributes are unordered whereas elements are typically ordered, which over-constrains many Entity schemas.
Although XML databases are object-like in their use of composite data structures, there is no corresponding class system in which to hold instance or class methods. This is especially true of course in a schema-free database such as eXist. However this does not prevent the developer from writing the Xquery code in modules which reflect the entity types, with the restriction that entities must be passed explicitly and inheritance is not supported. Modules must be associated with a namespace, so that conveniently the scope of a function name is the module and hence the entity type.
We distinguish between type inheritance, which is inheritance in the schema domain, from value inheritance, property inheritance, cascading or defaulting in the instance domain. Type inheritance is supported within a single XML schema, but whilst useful for factoring a schema, is of little use in a schema-free database. Inheritance of values of properties is however an important technique in data normalisation which is common in information models and supported directly in prototype-based object languages such as JavaScript.
In the FOLD, administrative roles for a programme are optionally defined at the level of programme groups (all the undergraduate course in the information systems school for example). At the level of a programme within a group, these roles may be defined or overridden. Finally, default values are held at the Faculty level.

Figure 2 : Programme Management
Simplisticly, hard-coded relationships can be used in procedural code. In this example, default values are defined in system parameters read from a configuration file:
$prog := //programme[programmeCode=$code],
$faculty := //group[name=$prog/Faculty],
$progGroup := //programmeGroup[name=$prog/groupName],
$leader := if($prog/leader) then $prog/leader
else if ($progGroup/leader) then ($progGroup/leader) else $faculty/academicDirector
SQL addresses this problem through the use of the coalesce function which here would be something like:
coalesce($prog/leader, $progGroup/leader, $faculty/academicDirector)
In XQuery, this can be expressed directly as:
$leader := ($prog/leader,$progGroup/leader,$faculty/academicDirector)[1]
which selects the first non-null node in the ordered sequence of nodes.
This raises the possibility of a generalised representation of the inheritance paths in a meta-model, but this would call for the dynamic construction of code and use of eval() with consequential performance and robustness problems. An intermediate approach would package the code into a module, which although not bound directly to the data as in a class-based system, does at least factor the code.
Views are an important concept in relational databases, and no less important in an XML database. Value inheritance is one example of the need for view construction. Another arises when an object must be augmented with derived data, either before passing down the process flow to XSLT, or for querying. For example, in displaying a module specification, codes need to be expanded to code + title. Value inheritance results in an object of the same type as the original, allowing downstream processes such as XSLT to be re-used. The same approach is possible with augmentation but it seems easier to use a wrapper around the original object together with the augmented data but, especially with a permissive schema, added the augmented data in place would be more consistent.
Views can either be transient or cached into the database. eXist supports triggers which can be used to refresh cached views and the choice between the approaches depends on the balance of update frequency, access frequency and the cost of materialisation.
A two-level ontology like XML schema creates the problem of deciding which domain concepts should be represented in the schema and which as data. Most decisions lie at clear ends of this spectrum, but there are always hard cases on the boundary.
A case in point is the representation of 'roles'. In the example on value inheritance above, the several roles in a programme are explicit as elements in the structure. There are problems with this approach. Roles are a generic concept, associating a person with a group with added properties of when the role commenced its tenure etc. Clearly this data can be added to each element as attributes or as sub-elements and the separate roles defined as sub-types in the XML schema, but XQuery, even when the schema is available, does not use this type hierarchy in query compilation. Secondly, new role kinds appear as the model develops in richness, so the schema would need frequent updating. Finally, the role types themselves need describing as first-class entities.
The resultant generalised model is shown in Figure 3.

Figure 3 Organisation Model
This generality comes at some expense, though less in an XML database than in a relational database, where this additional level would require an addition table and join. In XML, there is no performance penalty, but validation cannot be done with an XML schema and explicit validation queries are required against this meta data. Ideally there should also be transparent mapping between the generic view of the data and a view in which elements are named.
Although much emphasis is placed on the role of the schema for document validation, in a database this validation is insufficient. A schema cannot express inter-field or inter-document constraints, cannot ensure primary key uniqueness across neither collections, nor foreign key integrity across a database of documents nor check complex business constraints which may require reference to external systems. One of the main benefits of a Relational database is the strength of integrity checking and this is missing in an NXD. Whilst in some contexts, the lack of integrity is a major obstacle; in others it is less clearly so. In the FOLD, many of the integrity checks are business rules outside the scope of XML Schema or RDMS constraints. Data conversion from an existing document system supplemented by hand coding of new datasets meant that integrity constraints would have had to be turned off in any case since the data was full of errors and inconsistencies. Getting the data in any form under database management was more important than only allowing consistent data in. Indeed it is likely that the database will never be completely consistent, but humans cope with this - it is much harder to cope if data is excluded altogether because it fails validation.
Instead, validation queries have been developed to be run on demand. At present the results are merely provided as reports to the user, but the intention is to materialise the errors, refreshing them when documents are updated and using the error database as feedback on the quality of the dataset. This validation will include explicit schema validation which is supported in eXist. Such an approach is partly inspired by xlinkIt [3]
We distinguish between a restrictive schema, like a white list- everything not allowed is forbidden - and a permissive schema, like a black list - everything not forbidden is allowed. Most well-known are restrictive schemas, such as DTD, XML schema or RelaxNG which define a grammar for the document. Less well known are permissive schemas like Schematron based on pattern-matching.
Restrictive schemas can of course be more or less restrictive - an element may be a simple string in one schema, an enumeration in a more restrictive schema.
Permissive schemas may be useful in handling the 'fuzzy system boundary' problem. The focus for this system is the CEMS faculty as a sub-system of the University, internally strongly cohesive and weakly coupled to other facilities and to central administration. Thus the boundary of the system-of-interest is fuzzy. As an example, the simple Programme-Module model is complicated by the fact that whilst the bulk of the modules on a programme are taught within the Faculty, some are imported from other Faculties. Similarly, our Modules are used on Programmes run by other Faculties and our staff teach on modules run by other Faculties. This leads to the need for graduated models of these entities, in which the extensive data held about our own programmes and modules, required for control of faculty activities, degrade with distance from our faculty whilst retaining a core of basic data. For modules owned by the Faculty, module documents must conform to the full schema so that the full specification can be generated. For modules outside our control but inside our sphere of interest, a weaker schema is required which ensures that basic data is available. Moreover, the simple query is unlikely to care what order the code and title are in, whilst the restrictive schema would typically insist on this ordering.
A tougher nut to crack is that of the veracity of the data, i.e. is the degree of agreement between the data in the database and the situation being modelled in the real world. A high level of validation can give the impression of a high level of veracity, but these are no necessarily related. Indeed, tight validation often causes low veracity since it may force the entry of some value, indeed any value which gets past validation. A case in point is the agreement between the specification of what should be taught and how in the specification, versus how it is being taught in practice.
One approach to increasing veracity requires additional meta-data such as the date on which documents were edited and knowledge of the expected half-life of data veracity. This data allows the age of the data to be calculated, estimates made of its likely veracity, and processes scheduled to review data in a timely manner. In the absence of this meta-data and meta-processes, users have no feedback on data quality and over-time, quality degrades.
Here we are concerned with filling the gap between the XML repository and the world of the users.
The project uses the following architecture for a typical interaction between user and the database
(HTML)client->( Jetty -> eXist -> XQuery -> XSLT)server -> (CSS - > HTML )client -
This is currently for internal use only with minimal access control. We have to support external access via reverse-proxy on apache and this may present some difficulties, particularly for session management.
Allocation of functionality between XQuery and XSLT has been particularly an issue. The functionality of these two languages overlap considerably though their language styles are very different. At one extreme, XQuery could be used to generate the whole interface, at the other it would be used solely to construct intermediate derived documents with all presentation done with XSLT. Learning to devise good and reusable intermediate schemas and learning to use these two languages fluently and well has been a learning process which should not be underestimated.
eXist can also be deployed with Cocoon but so far there has been no need for this additional layer of complexity. This architecture performs surprisingly well, although performance benefits for careful query design and index creation.
For editing structured documents, the developer has a wide range of choices, from custom schema-driven forms generated in Microsoft InfoPath, generic editors in XMLSpy or Oxygen, or interfaces developed directly in XQuery. There are also commercial XML suites such as XMetaL which, but for the cost, would be ideal. The faculty would however prefer open-source solutions and XForms is also on the candidate list.
Office Excel 2003 provides good support for importing and exporting plain XML files. Currently, it appears that only table structured XML can be edited directly with Excel 2003. Microsoft provides a plug-in to provide additional XML support which includes the conversion of an area of a spreadsheet to XML. This approach is being used for those datasets which are currently held in spreadsheets. Using Excel 2003, these sheets can be simple dropped into the file system with no conversion required, providing a very direct connection between a powerful and familiar user interface and the repository.
Word 2003 can also edit XML files directly. The full XML capabilities of both Word and Excel have yet to be fully explored, but it is anticipated that increasing XML capability will be provided in time. InfoPath is also under evaluation for customised entry forms.
One of the weaknesses of the present information system is that updates are centralised. A major goal of the project was to distribute responsibility for update to the relevant parties in the faculty. This leads to the need to support long-transactions since edit times on a big document is a least of the order of minutes.
One approach being investigated is the WebDAV standard to provide a basic interface to the XML repository and Microsoft Office 2003 to provide basic XML editors. Database updates are here restricted to extracting, changing and updating whole documents.
In this architecture, a user sets up a Web Folder which points to the XML repository. From Word, Excel or InfoPath, the user can then browse this directory structure, open an XML document, update it and replace it or save to a document or collection. WebDAV provides basic support for long transactions, supporting document check-in and check-out with persistent locks.
The development approach has been that of incremental delivery, simultaneously exploring the capabilities of the chosen technology and of the needs of the users. For an information system, as opposed to a software tool, value to the users lies in access to quality data more than to clever functionality. Users will cope with difficult-to-use interfaces if they get the answers they need more rapidly or more accurately than previously.
This is a challenge for the development team, who are co-developing the system at the same time as the live data. Inevitably this means that the means to restructure live data in the light of schema changes is needed from an early stage of the project. In fact schema change is quite straightforward with XSLT or XQuery.
The models in this paper were developed in QSEE, a multi-case tool developed by Mark Dixon at Leeds Metropolitan University. A UML class model is perhaps the most appropriate modelling tool but is not entirely satisfactory. Ordering of relationships within the same class cannot be expressed, although this could be inferred from the left to right ordering of attachment points. Primary key fields cannot be indicated.
Model-driven development in this technology requires a tool which will generate multiple schemas from the conceptual model. Whilst there are many graphical tools for schema design, none handle multiple schemas from the one conceptual model. With the addition of stereotypes to guide the generation, this looks possible and is certainly desirable.
Native XML databases with XQuery and XSLT show great promise in providing a rapid and flexible application development environment for certain classes of problems. However, whilst development with Relational Databases is a mature discipline, the design space for NXDs has yet to be fully understood. Indeed the wider range of structures and tools increases the decision-making burden for the designer. With luck, this workshop will help to map out that space a little.
As already confessed, the choice of XML and NXD just seemed to be a good fit to this information system's requirements. However the broader question is whether there is a set of factors which are indicative or counter-indicative of the use of this technology. Preliminary indications are that for small, highly structured information systems requiring a web interface, NXD and eXist in particular offer a number of significant benefits over other approaches.
[1] R. Bourret(2006). XML and Databases http://www.rpbourret.com/xml/XMLDatabaseProds.htm [accessed 13 Feb 2006]..
[2] W. Meier (2003) eXist: an Open Source Native XML Database. In Revised Papers for the NODe 2002 Web and Database-Related Workshops on Web, Web-Services and Database System, pp 169-183, London 2003, Springer-Verlag
[3] C. Nentwich, L. Capra, W. Emmerich and A. Finkelstein. xlinkit: a Consistency Checking and Smart Link Generation Service, ACM Transactions on Internet Technology, 2(2), May 2002, pages 151-185.
[4] D. Schon (1983) The reflective practitioner: How professionals think in action. New York. Basic Books
eXist: exist-db.org/
FOLD - Faculty OnLine Data - www.cems.uwe.ac.uk/~cjwallac/FOLD
QSEE MultiCase tool: http://www.qsee-technologies.com/