Tuesday, February 22, 2005

Durus + ElementTree = XML Object Database or Frankenstein's Monster?

I had one of those wacky ideas where I can't necessarily think of an immediate application, but is easy enough to try out.

I just discovered Durus, which is a networkable, transactional object store for Python. Seems cool, even despite the admitted limitations on scalability, but could easily create Python lock-in if not careful. What if you're concerned about moving those objects to another system in another language down the road? Or what if you already have a large body of data that you want to access? Sounds like a reasonable use of XML to me. If nothing else, XML's more limited structure would prevent one from doing anything too Python-specific.

9 lines of magic: Grabbing my favorite Pythonic XML-tool, ElementTree, I was able to combine it with Durus in only 9 lines of Python:

from durus.persistent import Persistent
from durus.persistent_list import PersistentList
from elementtree import ElementTree

class PElementTree(ElementTree.ElementTree, Persistent):
ElementTree.ElementTree = PElementTree

class PElementInterface(ElementTree._ElementInterface, Persistent):
ElementTree._ElementInterface = PElementInterface

(It would be great to support cElementTree as well, but getting its extension types to pickle, a requirement for use in Durus, is non-trivial, at least relative to the above, which is the very definition of trivial. I might look into cElementTree mods in a future post.)

Then you can just add ElementTrees to the Durus root object:

tree = et_module.parse(xml_source)
connection = Connection(ClientStorage())
connection.get_root()["xml"] = tree

Reading and modifying the XML data is as easy as the ElementTree API.

What's the point of such a beast? You can have multiple clients accessing and modifying the same XML document (including basic transaction support). Without Durus (or similar object database approach), one would have to lock the XML files and then parse and dump the XML with each transaction. Backing up or transferring the data to nice clean XML is trivial -- you don't have to write an object -> XML mapping because the object is the XML. Of course, you lose some of the point of using a persistent object database by limiting yourself in this way. Lastly, this beast will have the same limitations on scalability as Durus, but that may not be a problem in all applications.

How does this thing perform? I devised some silly benchmarks (all of which use Jon Bosak's Old Testament in XML) trying to cover a range of use cases. If you know of other good tests, please suggest them to me and I'll try to include them here.

  • begat: Uche Ogbuji's "Old Testament" test: Finds all verses in the Old Testament containing the word "begat".

  • book_title: Find all the book titles in the old testament.

  • book_title_remove: Remove all of the title elements.

  • book_title_remove_text: Remove the text from all of the title elements, but leave the elements themselves intact.

  • upper_case: Convert all of the verses to upper case.

The code for these benchmarks is:

def begat_benchmark(tree):
for v in tree.findall("./bookcoll/book/chapter/v"):
if v.text.find("begat") >= 0:
print v.text
begat_benchmark.mutates = False

def book_title_benchmark(tree):
for title in tree.findall("./bookcoll/book/bktshort"):
print title.text
book_title_benchmark.mutates = False

def book_title_remove_benchmark(tree):
for book in tree.findall("./bookcoll/book"):
book_title_remove_benchmark.mutates = True

def book_title_remove_text_benchmark(tree):
for book in tree.findall("./bookcoll/book"):
book.find("./bktshort").text = ""
book_title_remove_text_benchmark.mutates = True

def upper_case_benchmark(tree):
for v in tree.findall("./bookcoll/book/chapter/v"):
v.text = v.text.upper()
upper_case_benchmark.mutates = True

Each of these benchmarks was run in four different environments:

  • ElementTree: Uses Python ElementTree to save/load XML to disk. When the benchmark involves changing the XML, the XML file is locked for the entire operation.

  • cElementTree: Same as above, but uses C ElementTree

  • Durus/ElementTree: Stores the ElementTree with a Durus server. The time for connecting to the database and committing to the database is included in the runtime.

  • Durus/Document-level ElementTree: Stores the entire document as a single item in the Durus server. (The Elements themselves are not "persisted" as separate objects). The results show that there are some interesting tradeoffs between then approach and the one above.

All times were measured using wall-clock time so that the time spent in the Durus server would be included. The tests were run on a dual Xeon system with 1GB of RAM. Obviously running over a network would add some overhead, and I didn't test that.

The results:

parse/connect benchmark total
ET 3.3172s 0.2096s 3.5268s
cET 0.1031s 0.1720s 0.2751s
Durus-ET 0.0022s 11.4826s 11.4848s
Durus-Doc-ET 0.0025s 0.8641s 0.8666s

parse/connect benchmark total
ET 3.3593s 0.0049s 3.3643s
cET 0.0951s 0.0024s 0.0974s
Durus-ET 0.0022s 1.8796s 1.8818s
Durus-Doc-ET 0.0026s 0.6883s 0.6909s

parse/connect benchmark write/commit total
ET 3.5531s 0.0015s 2.5252s 6.0799s
cET 0.0957s 0.0010s 2.3441s 2.4408s
Durus-ET 0.0022s 0.1090s 0.0004s 0.1115s
Durus-Doc-ET 0.0022s 0.7283s 0.0006s 0.7311s

parse/connect benchmark write/commit total
ET 3.5531s 0.0015s 2.5252s 6.0799s
cET 0.0957s 0.0010s 2.3441s 2.4408s
Durus-ET 0.0022s 0.1090s 0.0004s 0.1115s
Durus-Doc-ET 0.0022s 0.7283s 0.0006s 0.7311s

parse/connect benchmark write/commit total
ET 3.2904s 0.2834s 2.5160s 6.0898s
cET 0.0946s 0.1691s 2.3705s 2.6342s
Durus-ET 0.0022s 12.3476s 7.2875s 19.6373s
Durus-Doc-ET 0.0025s 0.9293s 0.0007s 0.9325s

Interesting to note that the aggregate times Durus at the document level outperforms Python ElementTree in every case. Note, however, that the overall size of the document is important here, as it will affect the parsing/writing times.

When seeks and edits are relatively few, storing the XML in Durus at the element level can be a real win. For grand sweeping changes over the whole file (upper_case benchmark), this fine-grained approach gets really bogged down by the serialize/unserialize overhead.

Useful? If you need concurrent access and transactions, this could be an interesting way to work on XML files. Please leave comments below.