XML Prague paper
Contents
- Abstract
- Introduction
- The basic tree APIs
- Generating XML (and HTML)
- Modeling XML
- Incremental parsing
- And much more
- Akara Web Framework
- Introducing WSGI, and working with URL path hierarchy
- Error handling, and making things more robust
- Handling HTTP POST
- Conclusion
- Appendix A: More background on 4Suite
Abstract
Akara is a platform for developing data services, and especially XML data services, available on the Web, using REST architecture. It is open source software (Apache 2 licensed) written in Python and C. An important concept in Akara is information pipelining, where discrete services can be combined and chained together, including services hosted remotely. There is strong support for pipeline stages for XML processing, as Akara includes a port of the well-known 4Suite and Amara XML processing components for Python. The version of Amara in Akara provides optimized XML processing using common XML standards as well as fresh ideas for expressing XML pattern processing, based on long experience in standards-based XML applications. Some of these features include XPath and XSLT, a lightweight, dynamic data binding mechanism, XML modeling and processing constraints by example (using Examplotron), Schematron assertions, XPath-driven streamable processing and overall low-level support for lazy iterator processing, and thus the map/reduce style. Akara does not enforce a built-in mechanism for persistence of XML, but is designed to complete a low-level persistence engine with overall characteristics of an XML DBMS.
Akara, despite its deliberately low profile to date, has played a crucial role in several marquee projects, including The Library of Congress's Recollection project and The Reference Extract project, a collaboration of The MacArthur Foundation, OCLC, and Zepheira. In Recollection Akara runs the data pipeline for user views, and is used to process XML MODS files with catalog records. In RefExtract Akara processes information about topics and related Web pages to provide measures of page credibility. Other users include Cleveland Clinic, Elsevier and Sun Microsystems.
This paper introduces Akara in general, but focuses on the innovative methods for XML processing, in stand-alone code and wrapped as RESTful services.
Introduction
Akara's developers have been involved in XML, and especially in XML processing with Python, since the very beginning. We've seen it all, and pretty much implemented it all. At first the motivation was that XML seemed the best hope for semi-structured database technology, but by now XML has become "just plumbing," as used in countless domains, including for many unsuited uses. There are many XML processing libraries in Python, and even the standard library finally has a respectable one with ElementTree.
So why a new XML pipeline and processing project, especially one as ambitious as Akara? The first answer is that it's not just about XML, but even focusing on the XML processing kit, the fact is most XML processing tools, not just in Python, but in general, are entirely focused on the dumb plumbing. These treat XML as a temporary inconvenience, rather than as a strategic technology. This is often justified, because most uses of XML by far are products of poor judgment, where other technologies would have been far more suited. But for those cases where XML is well suited, briefly characterized as where traditional, granular data combines with rich, prosaic expression, the current crop of tools is inadequate.
Akara's developers want to be able to treat with XML above the level of plumbing, to deal with it at the level of expression. Used correctly XML is not an inconvenience, and bears fruit when handled as richly and naturally as possible, because the data in XML is likely to outlive any particular code base a long, long time. At the same time one desires tools that make it easy to connect stuff suited to XML with stuff that's best suited to other formats. The ideal architecture supports pipelining XML processing with HTML, JSON, RDBMS and all that, without too much coupling to code.
Such requirements add up to tools that encourage working with XML in as declarative a manner as possible, operating at the level of data patterns, pattern dispatch and data modeling. It should be very natural to overlay semantic technology over XML whether in the form of RDF or in other formats at the higher level of semantic annotations. It's also important to start by getting the details right, such as proper handling of mixed content, and to keep perspective with powerful, generic modeling techniques such as Schematron Abstract Patterns.
Over a decade of XML processing has demonstrated the difficulty of pleasing the desperate Perl/Python/Ruby hacker without shredding the rich information expression benefits of XML. The upshot is the present interest in a “refactoring” of the XML stack, and especially accommodation of XML to a world in which JSON is firmly established on grounds previously assumed for XML. Akara aims at détente, applying traditional XML standards as much as reasonable, but judiciously deferring to more natural Python idioms where needed to avoid frustrating developers.
Akara provides these benefits, but with a strong preference for Web-based integration. The umbrella project is a Web framework, of which a key component is Amara 2, a port of the well-known 4Suite and Amara XML processing components for Python. Amara 2 provides optimized XML processing using common XML standards as well as fresh ideas for expressing XML pattern processing, based on long experience in standards-based XML applications. Some of these features include:
- XPath
- XSLT
- lightweight, dynamic data binding
- XML modeling and processing constraints by example (using Examplotron)
- Schematron assertions
- XPath-driven streamable processing
- low-level support for lazy iterator processing, and thus the map/reduce style
Amara 2.x is designed from the ground up for the architectural benefits discussed above. It treats data as much as possible in the data domain, rather than in the code domain. In practice one still needs good code interfaces, but the key balance to strike is in the nature of the resulting code. basic planks of the design principles are:
syncretism - combining the practical expressiveness of Python with the declarative purity of XML
- (it is very difficult to balancing such divergent technologies)
less code - support compact code, so there's less to maintain
grace - making it easy to do the right thing in terms of standards and practices (encourage sound design and modeling, using “less code” as an incentive)
declarative transparency - structuring usage for easy translation from one system of declarations to others, and to reuse standard declarations systems (such as XSLT patterns) where appropriate.
The result is an XML processing library that's truly different from anything else out there. Whether it suits one's tastes or not is a matter of taste.
The Web server system, Akara proper, is designed to be layered upon Amara 2.x in a RESTful context, as a lightweight vehicle for deploying data transforms as services.
The basic tree APIs
Amara 2.0 comes with several tree APIs, and makes it fairly easy to design custom tree APIs by extension.
Parsing XML into simple trees
The most fundamental tree API is just called amara.tree. It's very simple and highly optimized, but it lacks some of the features of the Bindery API, which is recommended unless you really need to wring out every ounce of performance.
import amara
from amara import tree
MONTY_XML = """<monty>
<python spam="eggs">What do you mean "bleh"</python>
<python ministry="abuse">But I was looking for argument</python>
</monty>"""
doc = amara.parse(MONTY_XML)
doc is an amara.tree.entity node, the root of nodes representing the elements, attributes, text, etc. in the document.
assert doc.xml_type == tree.entity.xml_type
doc.xml_children is a sequence of the child nodes of the entity, including the top element.
monty = doc.xml_children[0]
You might be wondering about the common "xml_" prefix for these methods. The higher-level Bindery (data binding) API builds on amara.tree. It constructs object attribute names from names in the XML document. In XML, names starting with "xml_" are reserved so this Amara convention helps avoid name clashes.
You can navigate from an node to its parent.
assert m.xml_parent == doc
Access all the components of the node's name, including namespace information.
assert m.xml_local == u'monty' #local name, i.e. without any prefix
assert m.xml_qname == u'monty' #qualified name, e.g. includes prefix
assert m.xml_prefix == None
assert m.xml_qname == u'monty' #qualified name, e.g. includes prefix
assert m.xml_namespace == None
assert m.xml_name == (None, u'monty') #The "universal name" or "expanded name"
A regular Python print tries to do the useful thing with with each node type
p1 = m.xml_children[0]
print p1.xml_children[0]
#<amara.tree.element at 0x5e68b0: name u'python', 0 namespaces, 1 attributes, 1 children>
print p1.xml_attributes[(None, u'spam')]
#eggs
Notice the difference between the treatment of elements and attributes.
To deserialize a node to XML use the xml_write or xml_encode method. The former writes to an output stream (stdout by default). The latter returns a string.
p1.xml_write()
#<python spam="eggs">What do you mean "bleh"</python>
You can manipulate the information content of XML nodes as well.
#Some manipulation
p1.xml_attributes[(None, u'spam')] = u"greeneggs"
p1.xml_children[0].xml_value = u"Close to the edit"
p1.xml_write()
Writing XML (and HTML) from nodes
As demonstrated above the xml_write() methods can be used to re-serialize a node to XML to as stream (sys.stdout by default). Use the xml_encode() method to re-serialize to XML, returning string. These work with entity as well as element nodes.
node.xml_write() #Write an XML document to stdout
node.xml_encode() #Return a UTF-8 XML string
There are special methods to look up a writer class from strings such as "xml" and "html"
from amara.writers import lookup
XML_W = lookup("xml")
HTML_W = lookup("html")
node.xml_write(XML_W) #Write out an XML document
node.xml_encode(HTML_W) #Return an HTML string
The default writer is the XML writer (i.e. amara.writers.lookup("xml"))
The pretty-printing or indenting writers are also useful.
node.xml_write(lookup("xml-indent")) #Write to stdout a pretty-printed XML document
node.xml_encode(lookup("html-indent")) #Return a pretty-printed HTML string
Note: you can also use the lookup strings directly:
node.xml_write("xml") #Write out an XML document
node.xml_encode("html") #Return an HTML string
Creating a document from scratch
The various node classes can be used as factories for creating entities/documents, and other nodes.
from amara import tree
doc = tree.entity()
doc.xml_append(tree.element(None, u'spam'))
doc.xml_write() #<?xml version="1.0" encoding="UTF-8"?>\n<spam/>
The XML bindery
Some of that xml_children[N] stuff is a bit awkward, and Amara includes a friendlier API called the XML bindery. It is like XML "data bindings" you might have heard of, but a more dynamic system that generates object attributes from the names and construct in the XML document.
from amara import bindery
MONTY_XML = """<monty>
<python spam="eggs">What do you mean "bleh"</python>
<python ministry="abuse">But I was looking for argument</python>
</monty>"""
doc = bindery.parse(MONTY_XML)
m = doc.monty
p1 = doc.monty.python #or m.python; p1 is just the first python element
print
print p1.xml_attributes[(None, u'spam')]
print p1.spam
for p in doc.monty.python: #The loop will pick up both python elements
p.xml_write()
Importantly, bindery nodes are subclasses of amara.tree nodes, so everything in the amara.tree section applies to amara.bindery nodes, including the methods for re-serializing to XML or HTML.
Amara bindery uses iterators to provide access to multiple child elements with the same name:
from amara import bindery
MONTY_XML = """<quotes>
<quote skit="1">This parrot is dead</quote>
<quote skit="2">What do you mean "bleh"</quote>
<quote skit="2">I don't like spam</quote>
<quote skit="3">But I was looking for argument</quote>
</quotes>"""
doc = bindery.parse(MONTY_XML)
q1 = doc.quotes.quote # or doc.quotes.quote[0]
print q1.skit
print q1.xml_attributes[(None, u'skit')] # XPath works too: q1.xml_select(u'@skit')
for q in doc.quotes.quote: # The loop will pick up both q elements
print unicode(q) # Just the child char data
from itertools import groupby
from operator import attrgetter
skit_key = attrgetter('skit')
for skit, quotegroup in groupby(doc.quotes.quote, skit_key):
print skit, [ unicode(q) for q in quotegroup ]
Creating a bindery document from scratch
WHen creating a document from scratch the special nature of bindery specializes the process a bit, involving the bindery entity base class:
from amara import bindery
doc = bindery.nodes.entity_base()
doc.xml_append(doc.xml_element_factory(None, u'spam'))
doc.xml_write() #<?xml version="1.0" encoding="UTF-8"?>\n<spam/>
The xml_append_fragment method is useful for accelerating the process a bit:
from amara import bindery
doc = bindery.nodes.entity_base()
doc.xml_append_fragment('<a><b/></a>')
doc.xml_write() #<?xml version="1.0" encoding="UTF-8"?>\n<a><b/></a>
Using XPath
XPath is also available for navigation. amara.tree (as well as Bindery and other derived node systems) fully supports XPath, which means all the other implementations do, as well. Use the xml_select method for nodes.
from amara import bindery
MONTY_XML = """<monty>
<python spam="eggs">What do you mean "bleh"</python>
<python ministry="abuse">But I was looking for argument</python>
</monty>"""
doc = bindery.parse(MONTY_XML)
m = doc.monty
p1 = doc.monty.python
print p1.xml_select(u'string(@spam)')
for p in doc.xml_select(u'//python'):
p.xml_write()
Parsing HTML
Amara integrates html5lib for building a bindery from non-well-formed HTML, and even non-well-formed XML (though the latter is always an abomination).
from amara.bindery import html
H = '''<html>
<head>
<title>Amara</title>
<body>
<p class=DESC>XML processing toolkit
<p>Python meets<br> XML
</html>
'''
doc = html.parse(H)
#Use bindery operations
print unicode(doc.html.head.title)
#Use XPath
print doc.xml_select(u"string(/html/head/title)")
#Re-serialize (to well-formed output)
doc.xml_write()
The last line in effect tidies up the messy markup, producing something like XHTML, but without the namespace.
Generating XML (and HTML)
Amara supports the traditional, well-known, SAX-like approach to generating XML.
output.startElement()
output.text()
output.endElement()
But this is generally awkward and unfriendly (e.g. the code block structure does not reflect the XML output structure, so it can be really hard to debug when you trip up the order of output constructs), so in this tutorial, we'll focus on structwriter, a rather more natural approach. The "struct" in this case is a specialized data structure that translates readily to XML. For now just the one example, which does cover most of the key bits:
import sys, datetime
from amara.writers.struct import *
from amara.namespaces import *
tags = [u"xml", u"python", u"atom"]
w = structwriter(indent=u"yes")
w.feed(
ROOT(
E((ATOM_NAMESPACE, u'feed'), {(XML_NAMESPACE, u'xml:lang'): u'en'},
E(u'id', u'urn:bogus:myfeed'),
E(u'title', u'MyFeed'),
E(u'updated', datetime.datetime.now().strftime('%Y-%m-%dT%H:%M:%SZ')),
E(u'author',
E(u'name', u'Uche Ogbuji'),
E(u'uri', u'http://uche.ogbuji.net'),
E(u'email', u'uche@ogbuji.net'),
),
E(u'link', {u'href': u'/blog'}),
E(u'link', {u'href': u'/blog/atom1.0', u'rel': u'self'}),
E(u'entry',
E(u'id', u'urn:bogus:myfeed:entry1'),
E(u'title', u'Hello world'),
E(u'updated', datetime.datetime.now().strftime('%Y-%m-%dT%H:%M:%SZ')),
( E(u'category', {u'term': t}) for t in tags ),
E(u'content', {u'type': u'xhtml'},
E((XHTML_NAMESPACE, u'div'),
E(u'p', u'Happy to be here')
))
)
)
)
)
This generates an Atom feed, and Atom is a pretty good torture test for any XML generator library. The output:
<?xml version="1.0" encoding="UTF-8"?>
<feed xmlns="http://www.w3.org/2005/Atom" xml:lang="en">
<id>urn:bogus:myfeed</id>
<title>MyFeed</title>
<updated>2008-09-12T15:09:16.321630</updated>
<name>
<title>Uche Ogbuji</title>
<uri>http://uche.ogbuji.net</uri>
<email>uche@ogbuji.net</email>
</name>
<link href="/blog"/>
<link rel="self" href="/blog/atom1.0"/>
<entry>
<id>urn:bogus:myfeed:entry1</id>
<title>Hello world</title>
<updated>2008-09-12T15:09:16.322755</updated>
<category term="xml"/>
<category term="python"/>
<category term="atom"/>
</entry>
<content type="xhtml">
<div xmlns="http://www.w3.org/1999/xhtml">
<p>Happy to be here</p>
</div>
</content>
</feed>
A few interesting points:
Structwriter tries to help the lazy hand a bit. If you create an element with a namespace, any child element without a namespace will inherit the mapping. This is why I only had to declare the Atom namespace on the top feed element. All the children picked up the default namespace until it got to the div element, which redefined the default as XHTML, which was then passed on to its p child.
You can create namespace declarations manually using the special NS(prefix, ns) construct. Just make sure it comes beyond any other type of child specified for that element. This is useful when you have QNames in content, e.g. generating XSLT or schema or SOAP or some other horror.
- This courtesy does not apply to attributes. If you don't declare an namespace attribute for an attribute it will have none.
- Structwriter also tries to be smart with strings versus unicode. I still recommend using Unicode smartly when working with XML, but if you get lazy and just specify something as a string, Structwriter will just convert it for you.
Notice the use of a generator expression (line 25) to generate the multiple category elements.
Generating XML (and HTML) gradually
The above works well if you have are generating an XML document all at a go, but that's not always the case. Perhaps you are generating a huge document little by little. Perhaps you are generating a document in bits based on processing of asynchronous events. In such cases, you might find useful the coroutine (or pseudo-coroutine, if you insist) form of the structwriter. You set up an envelope of the XML structure, and a marker to which you can send inner elements as you prepare them. The following simple example
from amara.writers.struct import structwriter, E, NS, ROOT, RAW, E_CURSOR
class event_handler(object):
def __init__(self, feed):
self.feed = feed
def execute(self, n):
self.feed.send(E(u'event', unicode(n)))
output = structwriter(indent=u"yes")
feed = output.cofeed(ROOT(E(u'log', E_CURSOR(u'events', {u'type': u'numberfeed'}))))
h = event_handler(feed)
for n in xrange(10):
h.execute(n)
feed.close()
Generates the following XML:
<?xml version="1.0" encoding="utf-8"?>
<log>
<events type="numberfeed">
<event>0</event>
<event>1</event>
<event>2</event>
<event>3</event>
<event>4</event>
<event>5</event>
<event>6</event>
<event>7</event>
<event>8</event>
<event>9</event>
</events>
</log>
Modeling XML
XML is eminently flexible, but this flexibility can be a bit of a pain for developers. Amara is all about making XML less of a pain for developers, and in Amara 2.0 you have a powerful new tool. You can control the content model of parsed XML documents, and you can use such information to simplify things, with just a little up-front work. You can do this in several ways but I'll focus on the "modeling by example" approach.
Examplotron (see "Introducing Examplotron") is an XML schema language where an example document is basically your schema. The following listing is a regular XML document, and is also an Examplotron schema.
LABEL_MODEL = '''<?xml version="1.0" encoding="utf-8"?>
<labels>
<label>
<name>[Addressee name]</name>
<address>
<street>[Address street info]</street>
<city>[City]</city>
<state>[State abbreviation]</state>
</address>
</label>
</labels>
'''
It establishes a model that there is a labels element at the top, containing a label element child, and so on. In this case the intention is that there are multiple label element children and Examplotron allows you to clarify this point using an inline annotation:
LABEL_MODEL = '''<?xml version="1.0" encoding="utf-8"?>
<labels xmlns:eg="http://examplotron.org/0/">
<label eg:occurs="*">
<name>[Addressee name]</name>
<address>
<street>[Address street info]</street>
<city>[City]</city>
<state>[State abbreviation]</state>
</address>
</label>
</labels>
'''
Specifically, eg:occurs="*" indicates 0 or more occurrences.
The following is an XML document that conforms to the schema.
VALID_LABEL_XML = '''<?xml version="1.0" encoding="utf-8"?>
<labels>
<label>
<name>Thomas Eliot</name>
<address>
<street>3 Prufrock Lane</street>
<city>Stamford</city>
<state>CT</state>
</address>
</label>
</labels>
'''
The following is an XML document that does not conform to the schema.
INVALID_LABEL_XML = '''<?xml version="1.0" encoding="utf-8"?>
<labels>
<label>
<quote>What thou lovest well remains, the rest is dross</quote>
<name>Ezra Pound</name>
<address>
<street>45 Usura Place</street>
<city>Hailey</city>
<state>ID</state>
</address>
</label>
</labels>
'''
The quote element is not in the model.
One specifies the XML model to use when parsing to Bindery.
from amara.bindery.model import *
label_model = examplotron_model(LABEL_MODEL)
doc = bindery.parse(VALID_LABEL_XML, model=label_model)
doc.xml_validate()
doc = bindery.parse(INVALID_LABEL_XML, model=label_model)
try:
doc.xml_validate()
except bindery.BinderyError, e:
print e
doc.xml_write()
Parse INVALID_LABEL_XML succeeds but the xml_validate() method fails and raises an exception because of the unexpected quote element. Note: it's no problem to validate just an element's subtree rather than the entire document. This validation is also available after mutation with the Amara API. Validation can be a bit expensive (though not noticeably unless you're dealing with huge docs), so it should be used judiciously. The penalty is only paid upon actual validation. Mutation, document access and other operations proceed at regular speed.
With a somewhat irregular XML document, it can be tricky to use bindery object traversal (e.g. doc.labels.label) without risking AttributeError. A model used in parsing a document makes the binding smarter, setting a default value to be returned in cases where a known element happens to be missing somewhere in the instance document.
LABEL_MODEL = '''<?xml version="1.0" encoding="utf-8"?>
<labels>
<label>
<quote>What thou lovest well remains, the rest is dross</quote>
<name>Ezra Pound</name>
<address>
<street>45 Usura Place</street>
<city>Hailey</city>
<state>ID</state>
</address>
</label>
</labels>
'''
TEST_LABEL_XML = '''<?xml version="1.0" encoding="utf-8"?>
<labels>
<label>
<name>Thomas Eliot</name>
<address>
<street>3 Prufrock Lane</street>
<city>Stamford</city>
<state>CT</state>
</address>
</label>
</labels>
'''
from amara.bindery.model import *
label_model = examplotron_model(LABEL_MODEL)
doc = bindery.parse(TEST_LABEL_XML, model=label_model)
print doc.labels.label.quote #None, rather than raising AttributeError
So even though the instance document doesn't have a quote element, Amara knows from the model that this is an optional element. If you try to access the quote element you get back the default value of None, which can of course be overriden.
Extracting metadata from models
If the model uses inline declaration of particularly interesting parts of the document, Amara provides a mechanism to extract those interesting bits as an iterators of simple tuples, so one can in effect skip XML "API" altogether. In the following example the metadata extraction annotations are in the namespace given the ak prefix.
from amara.xpath import datatypes
from amara.bindery.model import examplotron_model, generate_metadata
from amara import bindery
from amara.lib import U
MODEL_A = '''<labels
xmlns:eg="http://examplotron.org/0/"
xmlns:ak="http://purl.org/xml3k/akara/xmlmodel">
<label id="tse" added="2003-06-10" eg:occurs="*" ak:resource="@id">
<!-- use ak:resource="" for an anonymous resource -->
<quote eg:occurs="?">
<emph>Midwinter</emph> Spring is its own <strong>season</strong>...
</quote>
<name ak:rel="name()">Thomas Eliot</name>
<address ak:rel="'place'" ak:value="concat(city, ',', province)">
<street>3 Prufrock Lane</street>
<city>Stamford</city>
<province>CT</province>
</address>
<opus year="1932" ak:rel="name()" ak:resource="">
<title ak:rel="name()">The Wasteland</title>
</opus>
<tag eg:occurs="*" ak:rel="name()">old possum</tag>
</label>
</labels>
'''
labelmodel = examplotron_model(MODEL_A)
INSTANCE_A_1 = '''<labels>
<label id="co" added="2004-11-15">
<name>Christopher Okigbo</name>
<address>
<street>7 Heaven's Gate</street>
<city>Idoto</city>
<province>Anambra</province>
</address>
<opus>
<title>Heaven's Gate</title>
</opus>
<tag>biafra</tag>
<tag>poet</tag>
</label>
</labels>
'''
doc = bindery.parse(INSTANCE_A_1, model=labelmodel)
for triple in generate_metadata(doc): #Triples, but only RDF if you want it to be
print (triple[0], triple[1], U(triple[2]))
The output is:
(u'co', u'name', u'Christopher Okigbo')
(u'co', u'place', u'Idoto,Anambra')
(u'co', u'opus', u'r2e0e1e5')
(u'r2e0e1e5', u'title', u"Heaven's Gate")
(u'co', u'tag', u'biafra')
(u'co', u'tag', u'poet')
Each triple is (current-resource-id, relationship-string, result-xpath-expression). Notice the U convenience function, which takes an object and figures out a way to get you back a Unicode object.
Python's iterator goodness makes it easy to organize this data in any convenient way, for example:
from itertools import groupby
from operator import itemgetter
from amara.lib import U
for rid, triples in groupby(generate_metadata(doc), itemgetter(0)):
print 'Resource:', rid
for row in triples:
print '\t', (row[0], row[1], U(row[2]))
The output is:
Resource: co
(u'co', u'name', u'Christopher Okigbo')
(u'co', u'place', u'Idoto,Anambra')
(u'co', u'opus', u'r2e0e1e5')
Resource: r2e0e1e5
(u'r2e0e1e5', u'title', u"Heaven's Gate")
Resource: co
(u'co', u'tag', u'biafra')
(u'co', u'tag', u'poet')
Incremental parsing
Imagine a 10MB XML file with a very long sequence of small records. If one tries to use a convenient tree API you will end up trying to load into memory several times the full XML document, but very often when processing such files, all that matters in processing is one record at a time. One could switch to SAX but then lose the convenience of the tree API.
Amara provides a system for incremental parsing which yields subtrees according to a declared pattern, provided as the function amara.pushtree, which requires a callback function, which gets sent the subtrees as they are ready. The function requires the full XML source and a pattern for the subtrees, as in the following example. The patterns are a subset of XPath.
from amara.pushtree import pushtree
from amara.lib import U
def receive_nodes(node):
print U(node) #Will print 0 then 1 then 10 then 11 with input below
return
XML="""<doc>
<one><a>0</a><a>1</a></one>
<two><a>10</a><a>11</a></two>
</doc>
"""
pushtree(XML, u'a', receive_nodes)
Which should put out:
0 1 10 11
You can also specialize the nodes sent to the callback. The most common use for this feature is to deal with more friendly Bindery nodes rather than raw tree nodes.
from amara.pushtree import pushtree
from amara.lib import U
from amara.bindery.nodes import entity_base
def receive_nodes(node):
print U(node.b) #Will print 0 then 1 then 10 then 11 with input below
return
XML="""<doc>
<one><a b='0'/><a b='1'/></one>
<two><a b='10'/><a b='11'/></two>
</doc>
"""
pushtree(XML, u'a', receive_nodes, entity_factory=entity_base)
Which should put out same as the earlier example.
One can use a coroutine if you need easier state management in the push target.
from amara.pushtree import pushtree
from amara.lib.util import coroutine
@coroutine
def receive_nodes(text_list):
while True:
node = yield
text_list.append(node.xml_encode())
return
XML="""<doc>
<one><a>0</a><a>1</a></one>
<two><a>10</a><a>11</a></two>
</doc>
"""
text_list = []
target = receive_nodes(text_list)
pushtree(XML, u'a', target.send)
target.close()
print text_list
And much more
Amara provides many facilities beyond those covered above, such as XSLT.
Akara Web Framework
A system for writing REST-friendly data services.
Functions can be written using the core library features, perhaps as unit transforms Apply simple wrappers to turn functions into RESTful services Akara runs as a repository of services, and allows you to discover these using a simple GET Service classes have IDs, independent from locations of individual service end-points Built-in facilities for Web triggers, AKA Web hooks AKA “Web hooks” (like DBMS triggers: declaration that one event actuates another, in this case HTTP requests) Modern multi-process Web server dispatch engine for services
A simple, complete Akara module
For the basic set-up of an Akara module, one can start with echo.py and then customize accordingly. The following is a complete module for which you can indicate a URL of an XML document and get from the HTTP response a count of elements therein. The simple case working involves with a Python function that takes a few parameters and returns a result, wrapping this whole thing as a Web service. For this case you can use the @simple_service decorator.
import amara
from akara.services import simple_service, response
ECOUNTER_SERVICE_ID = 'http://purl.org/akara/services/demo/element_counter'
@simple_service('GET', ECOUNTER_SERVICE_ID, 'ecounter', 'text/plain')
def ecounter(uri):
'Respond with the count of the number of elements in the specified XML document'
#e.g.: curl http://localhost:8880/ecounter?uri=http://hg.akara.info/testdoc.xml"
doc = amara.parse(uri[0])
ecount = doc.xml_select(u'count(//*)')
return str(ecount)
All Akara services have an ID, a URI (ECOUNTER_SERVICE_ID in the above), which represents the essence of that service, i.e. its inputs, outputs and behavior. You and I might take the same Akara code, and you host it on your server and I host it on mine. The service ID will be the same in both cases, but the access endpoint, i.e. what URL users invoke to use the services, will be different.
Use the @simple_service decorator to indicate that a function is a service, and specify what HTTP methods it handles, the ID for the service, and the default mount point, which is the trailing bit of the access endpoint URL. If you mount this service on an Akara instance running at http://localhost:8880, then its access endpoint will be http://localhost:8880. The user can HTTP POST some data to this URL, and the decorated function will be invoked.
akara_echo_body(body, ctype): }}}
ecounter is the decorated function. Simple service implementation functions wrapped as HTTP POST methods receive the HTTP POST body and the HTTP Content Type header as parameters. The latter is a convenience. All the other HTTP headers are also available using WSGI (more on this later).
The following version demonstrates some basic security features:
import amara
from akara.services import simple_service, response
ECOUNTER_SERVICE_ID = 'http://purl.org/akara/services/demo/element_counter'
#Config info is pulled in at global scope as AKARA_MODULE_CONFIG
#Security demo: create a URI jail outside of which XML operations won't leak
URI_JAIL = AKARA_MODULE_CONFIG.get('uri_jail')
#Create the assertion rule for the URI jail
ALLOWED = [(lambda uri, base=baseuri: uri.startswith(URI_JAIL), True)]
#Create a URI resolver instance that enforces the jail
restricted_resolver = irihelpers.resolver(authorizations=ALLOWED)
@simple_service('GET', ECOUNTER_SERVICE_ID, 'ecounter', 'text/plain')
def ecounter(uri):
#e.g.: curl http://localhost:8880/ecounter?uri=http://hg.akara.info/testdoc.xml"
uri = inputsource(uri[0], resolver=restricted_resolver)
doc = amara.parse(uri)
ecount = doc.xml_select(u'count(//*)')
return str(ecount)
Hello World
The following Akara module implements a simple Hello world service.
from akara.services import simple_service
HELLO_SERVICE_ID = 'http://example.org/my-services/hello'
@simple_service('GET', HELLO_SERVICE_ID, 'hello')
def helloworld(friend=None):
return u'Hello, ' + friend.decode('utf-8') #Returns a Unicode object
Save this as hello.py and make it available in `PYTHONPATH`, and update the akara.conf of an Akara instance to load the module. If the instance is at localhost:8880, you can invoke the new module as follows:
$ curl http://localhost:8880/hello?friend=Uche
Hello, Uche
Or, if you prefer, put http://localhost:8880/hello?friend=Uche into your browser to get the nice greeting. Go ahead and play around with URL basics, e.g.:
$ curl http://localhost:8880/hello?friend=Uche+Ogbuji
Hello, Uche Ogbuji
Which in this case behaves just like http://localhost:8880/hello?friend=Uche%20Ogbuji
Introducing WSGI, and working with URL path hierarchy
The above approach works fine if you are creating very simple, dynamic query services, but it gets very tempting to do too much of that, and to squander much of the benefit of REST.
In many Web applications, rather than calculating a greeting on the fly, we're instead gathering information and even modifying some well-known, referenceable resource. In such cases, the common convention is to use hierarchical URLs to represent the different resources. As an example, say we're developing a database of poets and their works. Each poet would be a distinct resource, e.g. at http://localhost:8880/poetdb/poet/ep.
To get this somewhat more sophisticated behavior, we take advantage of the common WSGI convention of Python. The following complete Akara module implements the poet database.
from wsgiref.util import shift_path_info
from akara.services import simple_service
from akara import request
POETDB_SERVICE_ID = 'http://example.org/my-services/poetdb'
#Cheap DBMS
POETDB = {
u'poet':
{
u'ep': (u'Ezra Pound', u'45 Usura Place, Hailey, ID'),
u'co': (u'Christopher Okigbo', u'7 Heaven\'s Gate, Idoto, Anambra, Nigeria')
},
u'work':
{
u'cantos': (u'The Cantos', u'../poet/ep'),
u'mauberley': (u'Hugh Selwyn Mauberley', u'../poet/ep'),
u'thunderpaths': (u'Paths of Thunder', u'../poet/co')
},
}
@simple_service('GET', POETDB_SERVICE_ID, 'poetdb', 'text/html')
def poetdb():
entitytype = shift_path_info(request.environ)
eid = shift_path_info(request.environ)
info = POETDB[entitytype][eid]
if entitytype == u'poet':
#name, address = info
return u'<p>Poet: %s, of %s</p>'%info
elif entitytype == u'work':
#name, poet = info
return u'<p>Work: %s, <a href="%s">click for poet info</a></p>'%info
Focusing in on some key lines:
from akara import request
...
entitytype = shift_path_info(request.environ)
The request object, which becomes available to your module through the import, is the main way to access information from the HTTP request, using WSGI conventions, such as the environ mapping. The Python stdlib function wsgiref.shift_path_info allows you to extract one hierarchical path component from the URL used to access the service.
So going back to the sample URL for a poet, http://localhost:8880/poetdb/poet/ep, Akara itself is mounted at http://localhost:8880/ and the service defined above is mounted at http://localhost:8880/poetdb/. The first wsgiref.shift_path_info extracts the poet component. There is a second one that extracts the ep component.
@simple_service('GET', POETDB_SERVICE_ID, 'poetdb', 'text/html')
Notice the additional argument, which declares the return content type. The output of this service is HTML.
return u'<p>Poet: %s, of %s</p>'%info
The return value is a Unicode object. You can return from an Akara service handler string or Unicode, or even parsed Amara XML objects.
Deploy this module and restart Akara and now if you go to e.g. http://localhost:8880/poetdb/work/cantos in a browser you will get a page saying "Work: The Cantos, click for poet info," and if you click the link it will take you to a page with the representation of the poet resource http://localhost:8880/poetdb/poet/ep, based on the relative link set up in the POETRYDB data structure.
Now you're really getting into the Web application space, and rubbing up a bit against REST in that resources such as poet and work are clearly identified by URL, and clearly referenced within the content via hypermedia (i.e. good old Web links).
Error handling, and making things more robust
Try out the following URL on the above service:
http://localhost:8880/poetdb/poet/noep
You get the dreaded 500 error. The Web is a wild place, and you never know what input or conditions you're going to be dealing with, so anticipating and gracefully handling errors is important. Let's set it up so that the server returns a 404 "Not Found" error in case the URL path doesn't match anything in the database. Let's also set up some basic link index pages to help the user. In general the following is a much more complete and functional example.
from wsgiref.util import shift_path_info, request_uri
from amara.lib.iri import join
from akara.services import simple_service
from akara import request, response
POETDB_SERVICE_ID = 'http://example.org/my-services/poetdb'
#Cheap DBMS
POETDB = {
u'poet':
{
u'ep': (u'Ezra Pound', u'45 Usura Place, Hailey, ID'),
u'co': (u'Christopher Okigbo', u'7 Heaven\'s Gate, Idoto, Anambra, Nigeria')
},
u'work':
{
u'cantos': (u'The Cantos', u'../poet/ep'),
u'mauberley': (u'Hugh Selwyn Mauberley', u'../poet/ep'),
u'thunder': (u'Paths of Thunder', u'../poet/co')
},
}
def not_found(baseuri):
ruri = request_uri(request.environ)
response.code = "404 Not Found"
return u'<p>Unable to find: %s, try the <a href="%s">index of works</a></p>'%(ruri, baseuri)
@simple_service('GET', POETDB_SERVICE_ID, 'poetdb', 'text/html')
def poetdb():
baseuri = request.environ['SCRIPT_NAME'] + '/'
def get_work(wid):
uri = join(baseuri, 'work', wid)
name, poet = POETDB[u'work'][wid]
puri = join(baseuri, 'poet', poet)
return '<p>Poetic work: <a href="%s">%s</a>, by <a href="%s">linked poet</a></p>'%(uri, name, puri)
def get_poet(pid):
uri = join(baseuri, 'poet', pid)
name, address = POETDB[u'poet'][pid]
return '<p>Poet: <a href="%s">%s</a></p>'%(uri, name)
getters = { u'work': get_work, u'poet': get_poet }
entitytype = shift_path_info(request.environ)
if not entitytype:
entitytype = u'work'
if entitytype not in POETDB:
return not_found(baseuri)
eid = shift_path_info(request.environ)
if not eid:
#Return an index of works or poets
works = []
for work_id, (name, poet) in POETDB[u'work'].iteritems():
works.append(getters[entitytype](work_id))
return '\n'.join(works)
try:
return getters[entitytype](eid)
except KeyError:
return not_found(baseuri)
Again, focusing on the key new bits:
from amara.lib.iri import join
Amara comes with a lot of URI, and more generally IRI (internationalized URI) functions which are more RFC-compliant than the urllib equivalents, including the join function which constructs URI references from hierarchical path components.
from akara import request, response
The response object allows you to manage HTTP request status, headers, and such..
def not_found(baseuri):
ruri = request_uri(request.environ)
response.code = "404 Not Found"
return u'<p>Unable to find: %s, try the <a href="%s">index of works</a></p>'%(ruri, baseuri)
Just a little utility function to provide a 404 response, with some information useful to the end user. request_uri is a Python stdlib function to reconstruct the request URI from a WSGI environment.
baseuri = request.environ['SCRIPT_NAME'] + '/'
Here you construct the URL to access this service.
def get_work(wid):
uri = join(baseuri, 'work', wid)
name, poet = POETDB[u'work'][wid]
puri = join(baseuri, 'poet', poet)
return '<p>Poetic work: <a href="%s">%s</a>, by <a href="%s">linked poet</a></p>'%(uri, name, puri)
A routine to generate HTML of the information for a single work. Notice how amara.lib.iri.join is used to construct links.
getters = { u'work': get_work, u'poet': get_poet }
Just a way to package up the reusable routines for generating poet and work info.
#Return an index of works or poets
works = []
for work_id, (name, poet) in POETDB[u'work'].iteritems():
works.append(getters[entitytype](work_id))
return '\n'.join(works)
Go through the index of works and return an aggregate HTML from the fragments.
Handling HTTP POST
The above example handles HTTP GET, and of course POST is a big part of the Web. It's best known for Web forms, though Akara is not specialized for such usage in the way more mainstream Web frameworks are (CherryPy, Django, etc.) You can use Akara to handle Web forms, but more often Akara users will be dealing with data services, often using requests directly POSTed to the endpoint. This is a common pattern for open Web APIs such as those of social networks.
Since POST on the Web is generally used in cases where state of Web resources are changing, this is usually the area where you need to deal with some sort of persistence in your application. You'll see an example of that in this section, moving from the in-memory data structure of the previous section to something more serious. You'll also see an example of how to read configuration information.
The following listing is an Akara module for accepting reservations of business resources such as conference rooms and the like.
This example is designed to illustrate the mechanics of POST handling, but is not a good example of REST style, presented as it is for simplicity. Akara does support strong REST principles, including hypermedia and proper use of HTTP verbs.
import shelve
from amara import bindery
from amara.lib import U
import akara
from akara.services import simple_service
DBFNAME = akara.module_config()['dbfile']
NEWPOET_SERVICE_ID = 'http://example.org/my-services/new-poet'
@simple_service('POST', NEWPOET_SERVICE_ID, 'newpoet', 'plain/text')
def newpoet(body, ctype):
'''
Add a poet to the database.
Sample POST body:
<newpoet id="co">
<name>Christopher Okigbo</name><address>Christopher Okigbo</address>
</newpoet>
'''
dbfile = shelve.open(DBFNAME)
#Warning: no validation of incoming XML
doc = bindery.parse(body)
dbfile[U(doc.newpoet.id)] = (U(doc.newpoet.name), U(doc.newpoet.address))
dbfile.close()
return 'Poet added OK'
This module requires a configuration variable, dbfile, which you can provide by adding the following (or similar) to akara.conf:
class tutorial_post:
dbfile = '/tmp/poet'Once the service is running, you can use something like the following command line to add a poet:
curl --request POST --data-binary "@-" "http://localhost:8880/newpoet" << END
<newpoet id="co">
<name>Christopher Okigbo</name><address>Christopher Okigbo</address>
</newpoet>
END
You can verify the result easily enough by querying the low level database file.
>>> import shelve
>>> d=shelve.open('/tmp/poet')
>>> print d.keys()
['co']
>>> print d['co']
(u'Christopher Okigbo', u'Christopher Okigbo')
This tutorial uses shelve for simplicity, but for real world applications, you almost certainly want to use another persistence back end, such as sqlite. Also, these examples are not safe for concurrent access from multiple module instances, which is just about guaranteed for a real-world application.
Conclusion
Akara's design makes it easy to integrate with other persistence facilities, from relational to state of the art DBMS, and certainly modern cloud-style storage services. It has seen a wide variety of use with mixed and matched components, whether incorporating Web-based transform and validation services or attaching modern visualization systems. One way to fulfill sophisticated XML-driven database requirements is to use monolithic software, but another important approach is to stitch together loosely coupled components from remote and local software. Akara offers a solid backbone for assembly of such heterogenous systems.
Appendix A: More background on 4Suite
In order to better understand the spirit behind Akara it's useful to have historical perspective of its predecessor 4Suite which enjoyed very active development for the decade starting 1998. 4Suite also spawned additional work and influence in numerous other areas, for example serving as the core XML processing toolkit for Red Hat and Fedora Core distributions in the mid 2000s, contributing components to the Python language, serving as a reference implementation for development of RFC 3986, and thus influencing several other packages.
4Suite and Akara have over the years provided several important innovations, including:
- XML/RDF triggered transforms (helped inspire GRDDL)
- Path-based RDF query mounted across an XML/RDF repository (Versa, which inspired many others, and was an input to W3C's SPARQL work)
- Rules-based (rather than type-systems-based) data binding for XML and RDF
- RDF query within XSLT
- Push-style data-driven multiple dispatch to code
- Pioneering implementations of DOM, XPath, XSLT, XLink, XPointer, RELAX NG, Schematron and more
import amara
from amara import tree
MONTY_XML = """<monty>
<python spam="eggs">What do you mean "bleh"</python>
<python ministry="abuse">But I was looking for argument</python>
</monty>"""
doc = amara.parse(MONTY_XML)
assert doc.xml_type == tree.entity.xml_type
m = doc.xml_children[0] #xml_children is a sequence of child nodes
assert m.xml_local == u'monty' #local name, i.e. without any prefix
assert m.xml_qname == u'monty' #qualified name, e.g. includes prefix
assert m.xml_prefix == None
assert m.xml_namespace == None
assert m.xml_name == (None, u'monty') #"universal" or "expanded” name
assert m.xml_parent == doc
p1 = m.xml_children[1]
p1.xml_attributes[(None, u'spam')] = u"greeneggs"
p1.xml_children[0].xml_value = u"Close to the edit"
from amara import bindery
MONTY_XML = """<quotes>
<quote skit="1">This parrot is dead</quote>
<quote skit="2">What do you mean "bleh"</quote>
<quote skit="2">I don't like spam</quote>
<quote skit="3">But I was looking for argument</quote>
</quotes>"""
doc = bindery.parse(MONTY_XML)
q1 = doc.quotes.quote # or doc.quotes.quote[0]
print q1.skit
print q1.xml_attributes[(None, u'skit')] # XPath works too: q1.xml_select(u'@skit')
for q in doc.quotes.quote: # The loop will pick up both q elements
print unicode(q) # Just the child char data
from itertools import groupby #Python stdlib
from operator import attrgetter #Python stdlib
skit_key = attrgetter('skit')
for skit, quotegroup in groupby(doc.quotes.quote, skit_key):
print skit, [ unicode(q) for q in quotegroup ]
from amara import bindery
MONTY_XML = """<quotes>
<quote skit="1">This parrot is dead</quote>
<quote skit="2">What do you mean "bleh"</quote>
<quote skit="2">I don't like spam</quote>
<quote skit="3">But I was looking for argument</quote>
</quotes>"""
from itertools import groupby #Python stdlib
from operator import attrgetter #Python stdlib
skit_key = attrgetter('skit')
for skit, quotegroup in groupby(doc.quotes.quote, skit_key):
print skit, [ unicode(q) for q in quotegroup ]
from amara import bindery
from amara.bindery.model import *
MONTY_XML = """<monty>
<python spam="eggs">What do you mean "bleh"</python>
<python ministry="abuse">But I was looking for argument</python>
</monty>"""
doc = bindery.parse(MONTY_XML)
#Add constraint that `python` elements must have `ministry` attribute
c = constraint(u'@ministry')
try:
doc.monty.python.xml_model.add_constraint(c, validate=True)
except bindery.BinderyError, e:
#Exception raised b/c the doc doesn’t meet the constraint we added
pass #ignore and move on
#Update the doc to meet the desired constraint
doc.monty.python.xml_attributes[None, u'ministry'] = u'argument'
doc.monty.python.xml_model.add_constraint(c, validate=True)
from amara import bindery
from amara.bindery.model import *
MONTY_XML = """<monty>
<python spam="eggs">What do you mean "bleh"</python>
<python ministry="abuse">But I was looking for argument</python>
</monty>"""
doc = bindery.parse(MONTY_XML)
#Add a constraint using a specialized model primitive that supports a default
c = attribute_constraint(None, u'ministry', u'nonesuch')
doc.monty.python.xml_model.add_constraint(c, validate=True)
LABEL_MODEL = '''<?xml version="1.0" encoding="utf-8"?>
<labels>
<label>
<name>[Addressee name]</name>
<address>
<street>[Address street info]</street>
<city>[City]</city>
<state>[State abbreviation]</state>
</address>
</label>
</labels>
'''
#Construct a set of constraints and other model info from the example
label_model = examplotron_model(LABEL_MODEL)
#Now use this to validate an instant document VALID_LABEL_XML
doc = bindery.parse(VALID_LABEL_XML, model=label_model)
doc.xml_validate()
INSTANCE_A_1 = '''<labels>
<label id="co" added="2004-11-15">
<name>Christopher Okigbo</name>
<address>
<street>7 Heaven's Gate</street>
<city>Idoto</city>
<province>Anambra</province>
</address>
<opus>
<title>Heaven's Gate</title>
</opus>
<tag>biafra</tag>
<tag>poet</tag>
</label>
</labels>
'''
from amara.bindery.model import generate_metadata
doc = bindery.parse(INSTANCE_A_1, model=labelmodel)
for triple in generate_metadata(doc): #Triples but not RDF ;)
print triple
import amara
from akara.services import simple_service, response
ECOUNTER_SERVICE_ID = 'http://purl.org/akara/services/demo/element_counter'
@simple_service('GET', ECOUNTER_SERVICE_ID, 'ecounter', 'text/plain')
def ecounter(uri):
doc = amara.parse(uri[0])
ecount = doc.xml_select(u'count(//*)')
return str(ecount)
from amara.pushtree import pushtree
from amara.lib import U #U() = "Unicode, dammit!"
def receive_nodes(node):
print U(node) #Will print 0 then 1 then 10 then 11 with input below
return
XML="""<doc>
<one>
<a>0</a><a>1</a>
</one>
<two>
<a>10</a><a>11</a>
</two>
</doc>
"""
pushtree(XML, u'a', receive_nodes)
