Project Description
Title
Performance Evaluation and Enhancement of XML store for Database Applications
Background and Goal
XML has gained more and more focus over the last years, as it has
become a standard way of exchanging data. One of the areas
where XML can be used is business to business communication, this can
be achieved by service oriented architectures, that externalise
public business services. These services can be activated remotely by
other companies, using XML as the data carrier of request and
response.
To make the services of one business publicly known for others to
find they can be registered in a central service repository, The
processes describing inter-business communication can be stored and
retrieved from XML repositories; the services made available by one
business can be registered in these central databases, for others to
subscribe to.
The above described scenario is just one of the reasons why we believe that XML will gain even more focus in the future. Reliability and fast storage facilities for storing large amounts of simistructured XML data is of paramount importance, if an XML Registry and Repository should become a reality.
Relational databases has existed for decades, and has been fine tuned for performance and reliability, as [1] and [4] shows this can not be said of the existing XML storage facilities. Most of them have problems with performance and are still suffering from child diseases, which cause them to behave unpredictably when handling the data. [4] shows how the chosen storage strategy has a compound effect on performance; and [1] shows that handling of large amounts of data, is not supported very well in the tested commercial XML storage systems.
Fritz Henglein describes an alternative XML storage facility: XMLStore of the Plan-X Project. XMLStore is an application configurable, distributed, mobile persistence layer for storing semistructured data (in particular XML documents) based on a value-oriented document model and interface. Together with XPath as query language operating on its data, it can be considered a native XML database system.
The goal of this project is to find out if XMLStore might provide a better solution to storing data encoded in XML syntax. We want to uncover this by comparing the results of [1] with similar test of XMLStore. During testing we might discover ways to improve performance of database functionality of the XMLStore, these improvements will be discussed and, depending on complexity, implemented.
Goals Described
The first goal of the project is to evaluate the performance of database like queries, performed on semi structured data, stored using XML syntax. The evaluation is based on [1], that evaluated a native XML database and an XML enabled RDBMS. This project will take the tests further by testing queries performed on flat text files storing XML, and the value oriented XMLStore from the Plan-X project. The second goal of the project is to suggest performance enhancements of the value oriented XML-store, based on the results of the first part of the project; both execution time and memory usage is taken into account.
The plan is not final, we expect it to change as we gain more knowledge of the project scope. The project parts and submodules are outlined and described below.
-
Performance evaluation
- All performance evaluations will include timing and system resource consumption during execution of the tested functionality.
- We want to be able to compare our result with the results from [1], so we have to at least perform the tests that are described there.
- We will not limit ourselves to the tests from [1], if these are not detailed enough to show where possible optimizations could be implemented. [1] mentions that it was rather cumbersome to set up their tested stores; so instead of duplicating their setup, we set up a baseline testenvironment for querying XML stored in flat text files using standard components.
-
Standard XML querying components
- Testing XML files seems simple but we believe it is an approach that is used in many applications exactly because of its simplicity. We will use this test as a benchmark, for the other tests, both those in this report and those described in [1].
- We will perform this test in two ways: a 'Cold Started' program and a server.
- The Cold Start test will be performed "out of the box", that way we will not be testing some of the extra functionality found in some of the other products being tested (here and in [1]), and we are sure we keep the simplicity of the file approach intact. "Out of the box" means that we use open source code modules to parse an XML file into into memory in a DOM structure, queries this structure using XPath, returns the result and shuts down.
- The server test, works as above, except that it takes a map of queries and files to query; and when it loads a file to memory it caches it there for later usage by another query.
- When comparing testresults we have to be sure what we are comparing. In our case there is both an XML data representation, and the query engine performing the queries on the data. When comparing the stores data representations, we need to be sure that the tests are not influenced by the query engine performing the queries, therefore we might need to use the same query engine to test different data implementations, if that is at all possible.
-
XMLStore
- Since the current XMLStore implementation can be dynamically configured in many different ways, each of these that make sense should be tested on the same benchmarks as [1] (this also requires the use of the XPath implementation of [3]).
- The testresults obtained will be analyzed and compared with the results of (i) and [1]. We will interpret the results to see if they are consistent with the results of [4]; and then provide, if possible, additional insights and conclusions on state-of-the-art XML storage technologies.
-
Enhancements of the XMLStore
- The enhancements described in this part of the project depends deeply on the results found in (I), which makes it hard to ouline very specific parts for this section; although we have some suggestions that might improve performance.
- Since the enhancements are based on the testresults of section (I), they will focus on performance of the XMLStore. If possible within the time and resource constraints of the project, we will design and implement the performance improvements in the XMLStore. Those improvements we do not implement will be described in detail, including possible tradeoffs, so later implementation will be possible.
-
Possible performance improvements
- Locating possible bottlenecks in the data flow from XML-file to XML-store. We will look at parsing and intermediate storage.
- An XML Schema over submitted data might make it possible to optimise the query strategy of XPath queries.
-
Indexes (Optional)
- Indexes over specific tags in the XML-tree can improve performance in query intensive applications (update intensive might suffer performance wise). For example an index over <tagname> could make a query like //tagname/subtagname perform better (especially in a wide and deep graph, where the <tagname>-tag is deeply positioned).
- The indexes could either be user defined, or automatically generated over the statistically most queried data. These indexes could of course be stored in the XMLStore itself.
Literature
[1] Morten Guld, Eske Bentzen, "Survey of XML Storage Technologies", Bachelor's project, DIKU, May 2003
[2] http://plan-x.org/xmlstore/
[3] Thomas Ambus, "XPATH Engine for XMLStore", Master's student project, DIKU, May 2003 (see http://www.ambus.dk/planx/xpath/; will be made available on plan-x.org)
[4] Tian, DeWitt, Chen, Zhang, "Design and Performance Evaluation of
Alternative XML Storage Strategies", SIGMOD Record, Vol. 31, No. 1,
March 2002, pp. 5-10
PDF:
http://www.acm.org/sigmod/record/issues/0203/SPECIAL/1.tian.pdf.gz
PS:
http://www.acm.org/sigmod/record/issues/0203/SPECIAL/1.tian.ps.gz
See also their 26 page technical report.
[5] "Data on the Web: From Relations to Semi structured Data and XML", by Serge Abiteboul, Peter Buneman, Dan Suciu; Morgan Kaufman, 1999 (out of print); should be available through DIKU's library
[6] Kasper Bøgebjerg Pedersen and Jesper Tejlgaard Pedersen, "Value-oriented XML Store", Master's thesis, ITU and DTU, 2002. http://www.it-c.dk/~kasperp/xmlstore/pdf/thesis.pdf
helmudt)