Project description by Fritz Henglein
Background and goal
Current storage technologies for XML documents include flat files, relational databases, object-based databases, and native XML databases.
XMLStore is the central component in the Plan-X Project. XMLStore is an application configurable, distributed, mobile persistence layer for storing semistructured data (in particular XML documents) based on a value-oriented document model and interface. Together with XPath as query language operating on its data, it can be considered a native XML database system.
Previous work [1] has empirically analyzed the performance of Software AG's Tamino native XML database and Oracle's "XML-enabled" object-relational database system Oracle 9i with respect to storing XML documents and querying them using XPath and has found that those systems perform surprisingly badly and unpredictably, especially when operating on general XML documents without XML Schema. That work did not include a comparison with a simple-minded implementation as baseline case or with XMLStore.
The goals of this project are
(1) to implement and evaluate a simple implementation of XML query processing based on standard components as a baseline for comparisons;
(2) to evaluate the present configurable XMLStore implementation [2] and XPATH implementation [3] on the same benchmarks as [1] in a number of a configurations and conduct a comparison with the results previously obtained for Tamino and Oracle9i;
(3) to design and implement important performance improvements to XML Store as well as discuss additional performance improvements, including possible trade-offs w.r.t. maintainability of XMLStore;
(4) to analyze the results of (1) and (2)+(3), compare them with the Tamino and Oracle 9i results of [1], interpret them to see if they are consistent with the results of [4], and provide if possible, additional insights and conclusions on state-of-the-art XML storage technologies.
(5) (optionally) to investigate the use of indexes in XMLStore to aid performance
Comments
ad (1) Two reference implementations are recommended: (a) a 'cold start' program that is called with an XML file name and an XPATH query as its arguments and operates as follows: it load the XML file from disk, parses it into an in-memory DOM-representation, executes a standard XPATH engine for DOM on it, formats the result for output, outputs it and shuts down; (b) a server version of the above, where the program receives a stream of pairs (XML file name, XPath query) and maintains a cache of DOM-representations in its memory for rapid reuse.
ad (2) XMLStore can be configured as a 'raw disk module' at its back end, and with a number of (performance) 'decorators' (in OO design pattern lingo) in front of it: caching, asynchronous writing, etc. The test design and test data from [1] should be reused and supplemented as necessary.
ad (3) This depends on the results of (2). An important part of the project is to evaluate the performance of different existing and performance-improved configurations of XMLStore against each other and, if possible, conclude with recommendations for XMLStore configurations addressing certain application requirements on performance, durability and the like.
ad (5) This is entirely optional (and likely not to be achievable within the given time and resource constraints). The main idea is to store XML documents in XMLStore that function as indexes for other XML documents; e.g., a B-tree indexing all nodes in a given XML document by their tag. The design of such indexes -- which to build and which not -- should be guided by a designated set of (XPATH) queries that are to be performance-optimized. The use of indexes should be performance evaluated by hand and in a 'static' setting: a document is stored in XMLStore, a set of XPATH queries is identified for querying (but not updating) it, one or more indexes are constructed and also stored in XMLStore. The XPath queries on the original document are transformed (by hand) to programs operating on the original document plus its indexes. The purpose of this is to evaluate the impact on performance achievable by good indexing. For the purposes of this project it should be noted that the following substantial issues are _not_ important: incremental maintenance of the indexes wrt. changes in the original document; automatic design of indexes from presentation of query set.
Method (work stages, in temporal order)
- Get familiar with XMLStore and value-oriented programming in general by reading [6]. Research (find, read and analyze) literature on techniques for storing and retrieving XML documents by reading [5] on data models and databases for semistructured data and [4], which reports results from an existing survey of XML storage technologies. [4] is an important benchmark (baseline for comparison) for the present project.
- Get familiar with the XMLStore implementation [2] and try it out in a number of different configurations, ranging from simple raw disk implementation to a complex chain of decorators in front of a raw disk implementation. Get familiar with Thomas Ambus's XPATH engine for XMLStore [3] and try it out.
- Design and implement simple DOM-based query engine(s); see goal (1).
- Reuse and reconsider test design and data from [1]. Add additional test data as necessary, but do so on the basis of well-founded test design considerations ('what can be found out by the new test data that couldn't be by the old ones?'). In particular, recall the test design in [1]:
- Classification of access patterns of XML clients (programs that store and retrieve XML documents); that is, in what order they produce the data in XML documents for storage, and in what order they use data in XML documents while running ( XML documents) and research characteristics of.
- Design and implementation of test data so that each class of clients categorized above is represented by at least one benchmark example that can be used for performance evaluation (and their implementations serve as illustrations of programmability).
- The classification process should be driven by systematic and empirical considerations: (a) Systematically: left-to-right exhaustive access to all nodes in all XML documents (exhaustive document order access); right-to-left access to all nodes; breadth-first search access to all nodes; key search guided access to small fragment of nodes; etc. (b) Empirically: Find existing applications and see how they access data; think about simple, realistic applications yourself. Specifically: think about XSLT applications and how XPATH and XQUERY queries access XML data.
- Conduct performance test: measure time and space consumption in a number of input scenarios. Analyze (look at, explain and compare) the performance results. Identify performance bottlenecks, especially for XMLStore.
- Design and implement central performance improvements based on identified performance bottlenecks above. Evaluate and document difference to unoptimized XMLStore.
- Work out work plan for goals (4), (5), and for report writing, on the basis of results achieved at this point and time that is left in the project.
Literature
[1] Morten Guld, Eske Bentzen, "Survey of XML Storage Technologies", Bachelor's project, DIKU, May 2003
[2] http://plan-x.org/xmlstore/
[3] Thomas Ambus, "XPATH Engine for XMLStore", Master's student project, DIKU, May 2003 (see http://www.ambus.dk/planx/xpath/; will be made available on plan-x.org)
[4] Tian, DeWitt, Chen, Zhang, "Design and Performance Evaluation of
Alternative XML Storage Strategies", SIGMOD Record, Vol. 31, No. 1,
March 2002, pp. 5-10
PDF:
http://www.acm.org/sigmod/record/issues/0203/SPECIAL/1.tian.pdf.gz
PS: http://www.acm.org/sigmod/record/issues/0203/SPECIAL/1.tian.ps.gz
See also their 26 page technical report.
[5] "Data on the Web: From Relations to Semi structured Data and XML", by Serge Abiteboul, Peter Buneman, Dan Suciu; Morgan Kaufman, 1999 (out of print); should be available through DIKU's library
[6] Kasper Bøgebjerg Pedersen and Jesper Tejlgaard Pedersen, "Value-oriented XML Store", Master's thesis, ITU and DTU, 2002. http://www.it-c.dk/~kasperp/xmlstore/pdf/thesis.pdf
Additional literature for indexing (if required):
Read up on indexing for relational databases in a standard text book on database systems.
Look for literature on indexing for XML or semistructured data, particularly with regard to supporting XPATH and/or similar query languages (such as XQUERY).
NB: Use the above as "seed" literature and leads. Look for more information using electronic research tools such as research index.com, google.com electronic libraries (accessible from DIKU through DIKU's web site).
helmudt)