Watering the trees of markup

4DOM, minidom, pDomlette, cDomlette, Amara bindery, and on, and on. I've been involved in the development of a good many XML tree APIs. For much of that work I've collaborated with Jeremy Kloth. Jeremy and I are back at it again. We have a lot of lessons learned under our belts about tree APIs. We've spent a good long while working through those lessons. Thinking through performance. thinking through APIs. Amara 2.0 now offers the fruit of all that. It comes with several tree APIs, and the machinery to readily make more, if you like.

The root of it all is the new, built-in class, amara.tree. It shares some lineage with Domlette, being implemented in C for performance, but the API has been completely reworked. First of all, it's compatible with data binding APIs by prefixing almost all its methods with "xml". It's also designed to add sanity to the well-known minefield of namespace behavior. Here's a little tooling around with the new API:

import amara
from amara import tree

MONTY_XML = """<monty>
  <python spam="eggs">What do you mean "bleh"</python>
  <python ministry="abuse">But I was looking for argument</python>
</monty>"""

doc = amara.parse(MONTY_XML)
#Node types use string rather than numerical constants now
#The root node type is called entity
assert doc.xml_type == tree.entity.xml_type
m = doc.xml_children[0] #xml_children is a sequence of child nodes
assert m.xml_local == u'monty' #local name, i.e. without any prefix
assert m.xml_qname == u'monty' #qualified name, e.g. includes prefix
assert m.xml_prefix == None
assert m.xml_qname == u'monty' #qualified name, e.g. includes prefix
assert m.xml_namespace == None
assert m.xml_name == (None, u'monty') #The "universal name" or "expanded name"
assert m.xml_parent == doc

p1 = m.xml_children[0]

from amara import xml_print
#<python spam="eggs">What do you mean "bleh"</python>
xml_print(p1)
print
print p1.xml_attributes[(None, u'spam')]

#Some manipulation
p1.xml_attributes[(None, u'spam')] = u"greeneggs"
p1.xml_children[0].xml_value = u"Close to the edit"
xml_print(p1)
print

Note: the root node type is called entity, because it can be an XML document entity (i.e. exactly one root element)

Some of that xml_children[N] stuff is a bit awkward, and people really liked the Amara 1.x bindery. Well, I like it too, so we've brought it back.

from amara import bindery
from amara import xml_print

MONTY_XML = """<monty>
  <python spam="eggs">What do you mean "bleh"</python>
  <python ministry="abuse">But I was looking for argument</python>
</monty>"""

doc = bindery.parse(MONTY_XML)
m = doc.monty
p1 = doc.monty.python #or m.python; p1 is just the first python element
print
print p1.xml_attributes[(None, u'spam')]
print p1.spam

for p in doc.monty.python: #The loop will pick up both python elements
    xml_print(p)
    print

Importantly, bindery nodes are subclasses of amara.tree nodes, so everything in the first listing applies fine to the nodes in the second. Just for completeness, I'll mention that we've thrown in some DOM support, as, I guess, a concession to legacy. Honestly, as Guido said "DOM sucks", and I really don't recommend it, but it's there:

from amara import dom
from amara import xml_print

MONTY_XML = """<monty>
  <python spam="eggs">What do you mean "bleh"</python>
  <python ministry="abuse">But I was looking for argument</python>
</monty>"""

doc = dom.parse(MONTY_XML)
for p in doc.getElementsByTagNameNS(None, u"python"): #A generator
    xml_print(p)
    print

p1 = doc.getElementsByTagNameNS(None, u"python").next()
print p1.getAttributeNS(None, u'spam')

One XMLism that's a lot more popular among developers is XPath. The amara core tree supports XPath (just use the new xml_select method on nodes), which means all the other implementations do, as well. I'll use bindery as an example. The main significance here is that bindery now has *full* XPath support, not just patched-up and partly broken support as in Amara 1.x.

from amara import bindery
from amara import xml_print

MONTY_XML = """<monty>
  <python spam="eggs">What do you mean "bleh"</python>
  <python ministry="abuse">But I was looking for argument</python>
</monty>"""

doc = bindery.parse(MONTY_XML)
m = doc.monty
p1 = doc.monty.python
print p1.xml_select(u'string(@spam)')

for p in doc.xml_select(u'//python'):
    xml_print(p)
    print

And now for something completely different! As a bonus we've added an html5lib tree builder based on bindery. This means you can rely on html5lib to parse whatever ugly HTML you have to throw at it, and end up with all the benefits of bindery.

import html5lib
from html5lib import treebuilders
from amara.bindery import html
from amara import xml_print

f = open("eg.html")
parser = html5lib.HTMLParser(tree=html.treebuilder)
doc = parser.parse(f)
print unicode(doc.html.head.title)
xml_print(doc.html.head.title)
print
print doc.xml_select(u"string(/html/head/title)")

Web-scraping just got a whole lot more fun. And that's all for day 2. Sorry for the delay from day 1. The man keeps making me work for my paycheck instead of hanging out with you good people :)

Amara2/Seven_days/2 (last edited 2008-11-24 18:46:31 by localhost)