The 4Suite approach to XML processing, which was then inherited by Amara, was the path o least resistance: parse everything as quickly as you can and do what you can to minimize the memory of the loaded tree without ghastly hacks that turn XML into useless rotten bits. Once you have the tree, make it easy to access and manipulate. We pushed optimization along those lines as much and reasonable, and to get the further performance boost we need we have to turn this philosophy on its head. Amara 2.x focuses on enriching the opportunities for manipulating things in the streaming phase.
Part of what this entails is optimizing the 80% or so of XPath used in the real world to make it available in streaming mode.
Streamable XPath subset for Amara 2.x
The following XPath 1.0 subset is recognized in Amara 2.x as available for processing in streaming mode. It's up to the code hosting the XPath to take advantage of this capability (e.g. XUpdate will be able to use this to update documents in streaming mode if all the XPaths meet these constraints).
Constraints
The subset is very similar to XSLT Patterns and FIXPtr
- All node tests
e.g. *, @*, test(), node(), processing-instruction(), comment()
- id()
e.g. id('spam')
- simple steps with only "downward" axes (self, child, descendant, descendant-or-self)
e.g. a/b, a//b (which of course expands to a/descendant-or-self::node()/child::b), a/b/c, a/descendant::a
- self really only makes sense used in predicates
Note: . is shorthand for self::node(), which is technically in the covered subset, but is rarely useful in streaming context, even in the allowed subset of predicates
- numerical predicates
e.g. a[1]
- a numerical or positional predicate is only allowed in a step using the child axis
e.g. these are forbidden: a/descendant::b[2] and a/descendant-or-self::b[2]
but this is OK: a//b[2] since it expands to a/descendant-or-self::node()/child::b[2] and hence the predicate is on a step using the child axis
- predicates that only use the attribute or self axis for value comparison
e.g. a[@attr = 'value']
- Boolean expressions in predicates
e.g. a[@id='1' or @id='2'], a[@id='1' and @spam='eggs'], a[@id='1' and not(@spam='eggs')]
- Union expressions at the top level of the XPath
e.g. a|b or a/b|a/c but not a/(b|c) or a[b|c]
Note: As the XPath spec says: "The location path //para[1] does not mean the same as the location path /descendant::para[1]. The latter selects the first descendant para element; the former selects all descendant para elements that are the first para children of their parents."
Nice to haves
Some patterns that we might try to support, if it doesn't hamper the implementation too badly:
- predicates with one level of downward look-ahead to children (whether existence or attribute)
e.g. a[b], a[b[@attr = 'value']]
- Union expressions within paths or predicates
e.g. a/(b|c) or a[b|c]
- Last function
e.g. a/b[last()]
Usage
Streamable XPath is used where possible for
- XInclude+Xpointer (only supports streamable XPath)
- XUpdate
Eventually we expect to use it for XSLT, i.e. certain XSLT constraints plus streamable XPath would be recognized as suitable for full streaming operation. But this is a very tricky area, so will probably come last
Typical algorithm for such use:
1) Parse 2) scan all expressions 2.1) If *all* expressions are streamable, operate in streaming mode 2.2) else full tree mode
We could in some cases possibly support a mixed mode--parse with supported exprs, then perform remaining ops on the full tree. For example in XUpdate this would involve splitting the instructions into 2 separate XUpdate docs.
Matching against an Amara pattern
The algorithm for matching a node against a pattern is similar to that of XSLT patterns.
Let's say the node you are testing is candidate-node. Candidate node matches the pattern if and only if there is a node among candidate-node's ancestors, reference-node such that if you execute the pattern as an XPath, with reference-node as the context, candidate-node is included among the results.
That's a bit of a handful, so here is some pseudocode for the concept:
matched = False
reference_node = candidate_node.xml_parent()
while reference_node:
if candidate_node in reference_node.xml_select(pattern):
matched = True
break
reference_node = reference_node.xml_parent()Note: for another example, see: http://bitbucket.org/uche/amara/src/tip/lib/bindery/util.py , line 82
See also
Yeah streamable XPath subsets are a perennial hot topic. A sampler:
http://www.oreillynet.com/onlamp/blog/2002/12/streaming_xpath.html
"Streaming XPath Processing with Forward and Backward Axes" (HTML from Google cache)
"XMLDog - Using SAX to Sniff XML". "XMLDog is a dog that is trained to sniff xml documents." It seems to run through the document fully to flag bits for the sniffing phase, rather than handing them off as encountered, so it's not entirely streamable int he sense we'd like for Amara, but it's a useful example.
Streaming Transformations for XML (STX) implements its own streamable subset
