Amara 2.0 tutorial
by Uche Ogbuji
Contents
Amara is a Python toolkit for XML processing. There are many such toolkits in Python (which is not a bad thing), so a word is in order about what sets Amara apart. Amara is not about shredding the XML as quickly as possible to turn it into something the desperate Python hacker can stomach. It is rather a system for working with XML as naturally as possible without too much that is alien from Python conventions. If you need to work with XML in a broader context, perhaps in cooperation with non-Python tools, and if you want to tackle some of the more interesting bits of XML, such as mixed content and declarative transforms, you'll probably fins Amara a very useful option.
In this tutorial I cover a bit of what Amara can do, focusing on things that give you a flavor of the Amara way. In particular I cover:
- Accessing and modifying the contents of XML and HTML documents (Amara supports parsing non-well formed HTML)
- Generating XML and HTML output
- Defining simple XML models to help accelerate development
And for fun,
- Multiple dispatch from XML document patterns to Python methods (think XSLT in Python-native form)
Watering the trees of markup
Amara 2.0 comes with several tree APIs, and makes it fairly easy to design your own custom tree APIs, if you like.
Parsing XML into simple trees
The most fundamental tree API is just called amara.tree. It's very simple and highly optimized, but it lacks some of the features of the Bindery API, which you'll be using unless you really need to wring out every ounce of performance. most of the time.
The following example
import amara
from amara import tree
MONTY_XML = """<monty>
<python spam="eggs">What do you mean "bleh"</python>
<python ministry="abuse">But I was looking for argument</python>
</monty>"""
doc = amara.parse(MONTY_XML)
doc is an amara.tree.entity node, the root of nodes representing the elements, attributes, text, etc. in the document.
assert doc.xml_type == tree.entity.xml_type
doc.xml_children is a sequence of the child nodes of the entity, including the top element.
monty = doc.xml_children[0]
You might be wondering about the common "xml_" prefix for these methods. The higher-level Bindery (data binding) API builds on amara.tree. It constructs object attribute names from names in the XML document. In XML, names starting with "xml_" are reserved so this Amara convention helps avoid name clashes.
You can navigate from an node to its parent.
assert m.xml_parent == doc
Access all the components of the node's name, including namespace information.
assert m.xml_local == u'monty' #local name, i.e. without any prefix
assert m.xml_qname == u'monty' #qualified name, e.g. includes prefix
assert m.xml_prefix == None
assert m.xml_qname == u'monty' #qualified name, e.g. includes prefix
assert m.xml_namespace == None
assert m.xml_name == (None, u'monty') #The "universal name" or "expanded name"
A regular Python print tries to do the useful thing with with each node type
p1 = m.xml_children[0]
print p1.xml_children[0]
#<amara.tree.element at 0x5e68b0: name u'python', 0 namespaces, 1 attributes, 1 children>
print p1.xml_attributes[(None, u'spam')]
#eggs
Notice the difference between the treatment of elements and attributes. We're still trying to work out how best to combine consistency and usefulness, and it might change.
To deserialize a node to XML use the xml_write or xml_encode method. The former writes to an output stream (stdout by default). The latter returns a string.
p1.xml_write()
#<python spam="eggs">What do you mean "bleh"</python>
You can manipulate the information content of XML nodes as well.
#Some manipulation
p1.xml_attributes[(None, u'spam')] = u"greeneggs"
p1.xml_children[0].xml_value = u"Close to the edit"
p1.xml_write()
Writing XML (and HTML) from nodes
As you saw above you can use the xml_write() methods to re-serialize a node to XML to as stream (sys.stdout by default). Use the xml_encode() method to re-serialize to XML, returning string. These work with entity as well as element nodes.
node.xml_write() #Write an XML document to stdout
node.xml_encode() #Return a UTF-8 XML string
By default xml_write writes to stdout using an xml writer, but you can change this default behaviour:
from amara.writers import lookup
HTML_W = lookup("html")
xml_write(writer=HTML_W, stream=my_file, encoding='iso-8859-1') #outputs html, into a file, encoding it as iso-8859-1
There are special methods to look up a writer class from strings such as "xml" and "html"
from amara.writers import lookup
XML_W = lookup("xml")
HTML_W = lookup("html")
node.xml_write(XML_W) #Write out an XML document
node.xml_encode(HTML_W) #Return an HTML string
The default writer is the XML writer (i.e. amara.writers.lookup("xml"))
The pretty-printing or indenting writers are also useful.
node.xml_write(lookup("xml-indent")) #Write to stdout a pretty-printed XML document
node.xml_encode(lookup("html-indent")) #Return a pretty-printed HTML string
Note: you can also use the lookup strings directly:
node.xml_write("xml") #Write out an XML document
node.xml_encode("html") #Return an HTML string
More on writing documents: Amara/Writing
Creating a document from scratch
You can use the various node classes as factories for creating entities/documents, and other nodes.
from amara import tree
doc = tree.entity()
doc.xml_append(tree.element(None, u'spam'))
doc.xml_write() #<?xml version="1.0" encoding="UTF-8"?>\n<spam/>
The XML bindery
Some of that xml_children[N] stuff is a bit awkward, and Amara includes a friendlier API called the XML bindery. It is like XML "data bindings" you might have heard of, but a more dynamic system that generates object attributes from the names and construct in the XML document.
from amara import bindery
MONTY_XML = """<monty>
<python spam="eggs">What do you mean "bleh"</python>
<python ministry="abuse">But I was looking for argument</python>
</monty>"""
doc = bindery.parse(MONTY_XML)
m = doc.monty
p1 = doc.monty.python #or m.python; p1 is just the first python element
print
print p1.xml_attributes[(None, u'spam')]
print p1.spam
for p in doc.monty.python: #The loop will pick up both python elements
p.xml_write()
Importantly, bindery nodes are subclasses of amara.tree nodes, so everything in the amara.tree section applies to amara.bindery nodes, including the methods for re-serializing to XML or HTML.
Amara bindery uses iterators to provide access to multiple child elements with the same name:
from amara import bindery
MONTY_XML = """<quotes>
<quote skit="1">This parrot is dead</quote>
<quote skit="2">What do you mean "bleh"</quote>
<quote skit="2">I don't like spam</quote>
<quote skit="3">But I was looking for argument</quote>
</quotes>"""
doc = bindery.parse(MONTY_XML)
q1 = doc.quotes.quote # or doc.quotes.quote[0]
print q1.skit
print q1.xml_attributes[(None, u'skit')] # XPath works too: q1.xml_select(u'@skit')
for q in doc.quotes.quote: # The loop will pick up both q elements
print unicode(q) # Just the child char data
from itertools import groupby
from operator import attrgetter
skit_key = attrgetter('skit')
for skit, quotegroup in groupby(doc.quotes.quote, skit_key):
print skit, [ unicode(q) for q in quotegroup ]
Creating a bindery document from scratch
You can also create a document from scratch, but the special nature of bindery specializes the process a bit. You can use the bindery entity base class:
from amara import bindery
doc = bindery.nodes.entity_base()
doc.xml_append(doc.xml_element_factory(None, u'spam'))
doc.xml_write() #<?xml version="1.0" encoding="UTF-8"?>\n<spam/>
You can also use xml_append_fragment to accelerate the process a bit:
from amara import bindery
doc = bindery.nodes.entity_base()
doc.xml_append_fragment('<a><b/></a>')
doc.xml_write() #<?xml version="1.0" encoding="UTF-8"?>\n<a><b/></a>
Using XPath
You can also use XPath for navigation. amara.tree (as well as Bindery and other derived node systems) fully supports XPath, which means all the other implementations do, as well. Use the xml_select method for nodes.
from amara import bindery
MONTY_XML = """<monty>
<python spam="eggs">What do you mean "bleh"</python>
<python ministry="abuse">But I was looking for argument</python>
</monty>"""
doc = bindery.parse(MONTY_XML)
m = doc.monty
p1 = doc.monty.python
print p1.xml_select(u'string(@spam)')
for p in doc.xml_select(u'//python'):
p.xml_write()
Parsing HTML
You can use html5lib tree to build a bindery from non-well-formed HTML, and even non-well-formed XML (though the latter is always an abomination). Just first install html5lib, e.g. using easy_install or pip.
from amara.bindery import html
H = '''<html>
<head>
<title>Amara</title>
<body>
<p class=DESC>XML processing toolkit
<p>Python meets<br> XML
</html>
'''
doc = html.parse(H)
#Use bindery operations
print unicode(doc.html.head.title)
#Use XPath
print doc.xml_select(u"string(/html/head/title)")
#Re-serialize (to well-formed output)
doc.xml_write()
The last line in effect tidies up the messy markup, producing something like XHTML, but without the namespace.
Generating XML (and HTML)
Amara supports the traditional, well-known, SAX-like approach to generating XML.
output.startElement()
output.text()
output.endElement()
But this is generally awkward and unfriendly (e.g. the code block structure does not reflect the XML output structure, so it can be really hard to debug when you trip up the order of output constructs), so in this tutorial, we'll focus on structwriter, a rather more natural approach. The "struct" in this case is a specialized data structure that translates readily to XML. For now just the one example, which does cover most of the key bits:
import sys, datetime
from amara.writers.struct import *
from amara.namespaces import *
tags = [u"xml", u"python", u"atom"]
w = structwriter(indent=u"yes")
w.feed(
ROOT(
E((ATOM_NAMESPACE, u'feed'), {(XML_NAMESPACE, u'xml:lang'): u'en'},
E(u'id', u'urn:bogus:myfeed'),
E(u'title', u'MyFeed'),
E(u'updated', datetime.datetime.now().strftime('%Y-%m-%dT%H:%M:%SZ')),
E(u'author',
E(u'name', u'Uche Ogbuji'),
E(u'uri', u'http://uche.ogbuji.net'),
E(u'email', u'uche@ogbuji.net'),
),
E(u'link', {u'href': u'/blog'}),
E(u'link', {u'href': u'/blog/atom1.0', u'rel': u'self'}),
E(u'entry',
E(u'id', u'urn:bogus:myfeed:entry1'),
E(u'title', u'Hello world'),
E(u'updated', datetime.datetime.now().strftime('%Y-%m-%dT%H:%M:%SZ')),
( E(u'category', {u'term': t}) for t in tags ),
E(u'content', {u'type': u'xhtml'},
E((XHTML_NAMESPACE, u'div'),
E(u'p', u'Happy to be here')
))
)
)
)
)
This generates an Atom feed, and Atom is a pretty good torture test for any XML generator library. The output:
<?xml version="1.0" encoding="UTF-8"?>
<feed xmlns="http://www.w3.org/2005/Atom" xml:lang="en">
<id>urn:bogus:myfeed</id>
<title>MyFeed</title>
<updated>2008-09-12T15:09:16.321630</updated>
<name>
<title>Uche Ogbuji</title>
<uri>http://uche.ogbuji.net</uri>
<email>uche@ogbuji.net</email>
</name>
<link href="/blog"/>
<link rel="self" href="/blog/atom1.0"/>
<entry>
<id>urn:bogus:myfeed:entry1</id>
<title>Hello world</title>
<updated>2008-09-12T15:09:16.322755</updated>
<category term="xml"/>
<category term="python"/>
<category term="atom"/>
</entry>
<content type="xhtml">
<div xmlns="http://www.w3.org/1999/xhtml">
<p>Happy to be here</p>
</div>
</content>
</feed>
I hope/expect the code largely speaks for itself. A few interesting notes:
Structwriter tries to help the lazy hand a bit. If you create an element with a namespace, any child element without a namespace will inherit the mapping. This is why I only had to declare the Atom namespace on the top feed element. All the children picked up the default namespace until it got to the div element, which redefined the default as XHTML, which was then passed on to its p child.
You can create namespace declarations manually using the special NS(prefix, ns) construct. Just make sure it comes beyond any other type of child specified for that element. This is useful when you have QNames in content, e.g. generating XSLT or schema or SOAP or some other horror.
- This courtesy does not apply to attributes. If you don't declare an namespace attribute for an attribute it will have none.
- Structwriter also tries to be smart with strings versus unicode. I still recommend using Unicode smartly when working with XML, but if you get lazy and just specify something as a string, Structwriter will just convert it for you.
Notice the use of a generator expression (line 25) to generate the multiple category elements.
Generating XML (and HTML) gradually
The above works well if you have are generating an XML document all at a go, but that's not always the case. Perhaps you are generating a huge document little by little. Perhaps you are generating a document in bits based on processing of asynchronous events. In such cases, you might find useful the coroutine (or pseudo-coroutine, if you insist) form of the structwriter. You set up an envelope of the XML structure, and a marker to which you can send inner elements as you prepare them. The following simple example
from amara.writers.struct import structwriter, E, NS, ROOT, RAW, E_CURSOR
class event_handler(object):
def __init__(self, feed):
self.feed = feed
def execute(self, n):
self.feed.send(E(u'event', unicode(n)))
output = structwriter(indent=u"yes")
feed = output.cofeed(ROOT(E(u'log', E_CURSOR(u'events', {u'type': u'numberfeed'}))))
h = event_handler(feed)
for n in xrange(10):
h.execute(n)
feed.close()
Generates the following XML:
<?xml version="1.0" encoding="utf-8"?>
<log>
<events type="numberfeed">
<event>0</event>
<event>1</event>
<event>2</event>
<event>3</event>
<event>4</event>
<event>5</event>
<event>6</event>
<event>7</event>
<event>8</event>
<event>9</event>
</events>
</log>
Modeling XML
XML is eminently flexible, but this flexibility can be a bit of a pain for developers. Amara is all about making XML less of a pain for developers, and in Amara 2.0 you have a powerful new tool. You can control the content model of parsed XML documents, and you can use such information to simplify things, with just a little up-front work. You can do this in several ways but I'll focus on the "modeling by example" approach.
Examplotron (see "Introducing Examplotron") is an XML schema language where an example document is basically your schema. The following listing is a regular XML document, and is also an Examplotron schema.
LABEL_MODEL = '''<?xml version="1.0" encoding="utf-8"?>
<labels>
<label>
<name>[Addressee name]</name>
<address>
<street>[Address street info]</street>
<city>[City]</city>
<state>[State abbreviation]</state>
</address>
</label>
</labels>
'''
It establishes a model that there is a labels element at the top, containing a label element child, and so on. In this case the intention is that there are multiple label element children and Examplotron allows you to clarify this point using an inline annotation:
LABEL_MODEL = '''<?xml version="1.0" encoding="utf-8"?>
<labels xmlns:eg="http://examplotron.org/0/">
<label eg:occurs="*">
<name>[Addressee name]</name>
<address>
<street>[Address street info]</street>
<city>[City]</city>
<state>[State abbreviation]</state>
</address>
</label>
</labels>
'''
Specifically, eg:occurs="*" indicates 0 or more occurrences.
The following is an XML document that conforms to the schema.
VALID_LABEL_XML = '''<?xml version="1.0" encoding="utf-8"?>
<labels>
<label>
<name>Thomas Eliot</name>
<address>
<street>3 Prufrock Lane</street>
<city>Stamford</city>
<state>CT</state>
</address>
</label>
</labels>
'''
The following is an XML document that does not conform to the schema.
INVALID_LABEL_XML = '''<?xml version="1.0" encoding="utf-8"?>
<labels>
<label>
<quote>What thou lovest well remains, the rest is dross</quote>
<name>Ezra Pound</name>
<address>
<street>45 Usura Place</street>
<city>Hailey</city>
<state>ID</state>
</address>
</label>
</labels>
'''
The quote element is not in the model.
You can specify the XML model to use when parsing to Bindery.
from amara.bindery.model import *
label_model = examplotron_model(LABEL_MODEL)
doc = bindery.parse(VALID_LABEL_XML, model=label_model)
doc.xml_validate()
doc = bindery.parse(INVALID_LABEL_XML, model=label_model)
try:
doc.xml_validate()
except bindery.BinderyError, e:
print e
doc.xml_write()
You can parse INVALID_LABEL_XML, but the xml_validate() method fails and raises an exception because of the unexpected quote element. Note: you can validate just an element's subtree rather than the entire document. You can also validate after mutation them with the Amara API. Validation can be a bit expensive (though not noticeably unless you're dealing with huge docs), so use it judiciously. You only pay that penalty upon actual validation. Mutation, document access and other operations proceed at regular speed.
If you have a somewhat irregular XML document, it can be tricky to use bindery object traversal (e.g. doc.labels.label) without risking AttributeError. If you use a model in parsing a document, this model makes the binding smarter, and you can set a default value to be returned in cases where a known element happens to be missing somewhere in your instance document.
LABEL_MODEL = '''<?xml version="1.0" encoding="utf-8"?>
<labels>
<label>
<quote>What thou lovest well remains, the rest is dross</quote>
<name>Ezra Pound</name>
<address>
<street>45 Usura Place</street>
<city>Hailey</city>
<state>ID</state>
</address>
</label>
</labels>
'''
TEST_LABEL_XML = '''<?xml version="1.0" encoding="utf-8"?>
<labels>
<label>
<name>Thomas Eliot</name>
<address>
<street>3 Prufrock Lane</street>
<city>Stamford</city>
<state>CT</state>
</address>
</label>
</labels>
'''
from amara.bindery.model import *
label_model = examplotron_model(LABEL_MODEL)
doc = bindery.parse(TEST_LABEL_XML, model=label_model)
print doc.labels.label.quote #None, rather than raising AttributeError
So even though the instance document doesn't have a quote element, Amara knows from the model that this is an optional element. If you try to access the quote element you get back the default value of None. You can of course override this default if you like.
Extracting metadata from models
You can also declare inline in the Examplotron model parts of the document that you find particularly interesting. Amara gives you a mechanism to extract those interesting bits as an iterators of simple tuples, so you can in effect skip XML "API" altogether. In the following example the metadata extraction annotations are in the namespace given the ak prefix.
from amara.xpath import datatypes
from amara.bindery.model import examplotron_model, generate_metadata
from amara import bindery
from amara.lib import U
MODEL_A = '''<labels
xmlns:eg="http://examplotron.org/0/"
xmlns:ak="http://purl.org/xml3k/akara/xmlmodel">
<label id="tse" added="2003-06-10" eg:occurs="*" ak:resource="@id">
<!-- use ak:resource="" for an anonymous resource -->
<quote eg:occurs="?">
<emph>Midwinter</emph> Spring is its own <strong>season</strong>...
</quote>
<name ak:rel="name()">Thomas Eliot</name>
<address ak:rel="'place'" ak:value="concat(city, ',', province)">
<street>3 Prufrock Lane</street>
<city>Stamford</city>
<province>CT</province>
</address>
<opus year="1932" ak:rel="name()" ak:resource="">
<title ak:rel="name()">The Wasteland</title>
</opus>
<tag eg:occurs="*" ak:rel="name()">old possum</tag>
</label>
</labels>
'''
labelmodel = examplotron_model(MODEL_A)
INSTANCE_A_1 = '''<labels>
<label id="co" added="2004-11-15">
<name>Christopher Okigbo</name>
<address>
<street>7 Heaven's Gate</street>
<city>Idoto</city>
<province>Anambra</province>
</address>
<opus>
<title>Heaven's Gate</title>
</opus>
<tag>biafra</tag>
<tag>poet</tag>
</label>
</labels>
'''
doc = bindery.parse(INSTANCE_A_1, model=labelmodel)
for triple in generate_metadata(doc): #Triples, but only RDF if you want it to be
print (triple[0], triple[1], U(triple[2]))
The output is:
(u'co', u'name', u'Christopher Okigbo')
(u'co', u'place', u'Idoto,Anambra')
(u'co', u'opus', u'r2e0e1e5')
(u'r2e0e1e5', u'title', u"Heaven's Gate")
(u'co', u'tag', u'biafra')
(u'co', u'tag', u'poet')
Each triple is (current-resource-id, relationship-string, result-xpath-expression). I use the U convenience function, which takes an object and figures out a way to get you back a Unicode object.
You can apply Python's iterator goodness to organize this data in any convenient way, for example:
from itertools import groupby
from operator import itemgetter
from amara.lib import U
for rid, triples in groupby(generate_metadata(doc), itemgetter(0)):
print 'Resource:', rid
for row in triples:
print '\t', (row[0], row[1], U(row[2]))
The output is:
Resource: co
(u'co', u'name', u'Christopher Okigbo')
(u'co', u'place', u'Idoto,Anambra')
(u'co', u'opus', u'r2e0e1e5')
Resource: r2e0e1e5
(u'r2e0e1e5', u'title', u"Heaven's Gate")
Resource: co
(u'co', u'tag', u'biafra')
(u'co', u'tag', u'poet')
There's a full listing of this section's code at generate_metadata_demo.py
Incremental parsing
Imagine you have a 10MB XML file with a very long sequence of small records. If you try to use a convenient tree API you will end up trying to load into memory several times the full XML document, but very often when processing such files, you only care about processing one record at a time. You could switch to SAX but you lose the convenience of the tree API.
Amara provides a system for incremental parsing which yields subtrees according to a declared pattern, provided as the function amara.pushtree, which requires you to set up a callback function, which gets sent the subtrees as they are ready. That's not nearly as comlex as it might sound, and the following examples should get you going quickly.
You pass pushtree the full XML source and a pattern for the subtrees, as in the following example. The patterns are a subset of XPath.
from amara.pushtree import pushtree
from amara.lib import U
def receive_nodes(node):
print U(node) #Will print 0 then 1 then 10 then 11 with input below
return
XML="""<doc>
<one><a>0</a><a>1</a></one>
<two><a>10</a><a>11</a></two>
</doc>
"""
pushtree(XML, u'a', receive_nodes)
Which should put out:
0 1 10 11
You can also specialize the nodes sent to the callback. The most common use for this feature is to deal with more friendly Bindery nodes rather than raw tree nodes.
from amara.pushtree import pushtree
from amara.lib import U
from amara.bindery.nodes import entity_base
def receive_nodes(node):
print U(node.b) #Will print 0 then 1 then 10 then 11 with input below
return
XML="""<doc>
<one><a b='0'/><a b='1'/></one>
<two><a b='10'/><a b='11'/></two>
</doc>
"""
pushtree(XML, u'a', receive_nodes, entity_factory=entity_base)
Which should put out same as the earlier example.
You can use a coroutine if you need easier state management in the push target.
from amara.pushtree import pushtree
from amara.lib.util import coroutine
@coroutine
def receive_nodes(text_list):
while True:
node = yield
text_list.append(node.xml_encode())
return
XML="""<doc>
<one><a>0</a><a>1</a></one>
<two><a>10</a><a>11</a></two>
</doc>
"""
text_list = []
target = receive_nodes(text_list)
pushtree(XML, u'a', target.send)
target.close()
print text_list
XSLT
You can run XSLT transforms using the amara.xslt.transform function
from amara.xslt import transform
result = xml_transform(source, transform)
source and transform can be strings, file objects, file names, URIs or input source objects. In the case of transform it can either be an individual one of these, or can be a list thereof. In the latter case, each subsequent transform is treated as an out-of-band XSLT import into its predecessor.
You can also apply XSLT to a parsed node by providing a the transform.
result node.xml_transform(transforms)
Both these functions also take a dictionary for top-level XSLT parameters.
result = xml_transform(source, transform, params={u'title': u'A cool title'})
The result is an instance of one of the subclasses of amara.xslt.result, specifically stringresult, streamresult, and treeresult. Key properties of these are:
- `result.stream`
- stream buffer of the processor (not available on stringresult and treeresult instances)
- `result.method
- xsl:method encoding parameter
- `result.encoding`
- xsl:output encoding parameter
- `result.media_type`
- xsl:output mediaType parameter
- `result.parameters`
- all other parameters set during transform execution
There is a new function amara.xpath.parameterize, which takes a dictionary and turns it into a set of parameters suitable for passing into an XSLT transform. It's basically a convenience function to make it fairly easy to pass Python data into transforms.
from amara import parse
from amara.xpath.util import parameterize
doc = parse('<monty spam="1"><python/></monty>')
e1 = doc.xml_first_child
e2 = e1.xml_first_child #doc.xml_select(u'//python')
a1 = e1.xml_attributes[None, u'spam'] #e1.xml_select(u'@spam')
D = {'p1': 1, 'p2': e1, 'p3': e2, 'p4': a1}
print parameterize(D)
#Result is something like
#{(None, u'p3'): <amara.tree.element at 0x10179a140: name u'python', 0 namespaces, 0 attributes, 0 children>, (None, u'p4'): String(u'1'), (None, u'p1'): Number(1), (None, u'p2'): <amara.tree.element at 0x10179a0c8: name u'monty', 0 namespaces, 1 attributes, 1 children>}
See also
Amara/Seven_days/3 - Some more examples using Bindery with Python iterators
