The 4Suite approach to XML processing, which was then inherited by Amara, was the path o least resistance: parse everything as quickly as you can and do what you can to minimize the memory of the loaded tree without ghastly hacks that turn XML into useless rotten bits. Once you have the tree, make it easy to access and manipulate. We pushed optimization along those lines as much and reasonable, and to get the further performance boost we need we have to turn this philosophy on its head. Amara 2.x focuses on enriching the opportunities for manipulating things in the streaming phase.

Part of what this entails is optimizing the 80% or so of XPath used in the real world to make it available in streaming mode.

Streamable XPath subset for Amara 2.x

The following XPath 1.0 subset is recognized in Amara 2.x as available for processing in streaming mode. It's up to the code hosting the XPath to take advantage of this capability (e.g. XUpdate will be able to use this to update documents in streaming mode if all the XPaths meet these constraints).

Constraints

The subset is very similar to XSLT Patterns and FIXPtr

Note: As the XPath spec says: "The location path //para[1] does not mean the same as the location path /descendant::para[1]. The latter selects the first descendant para element; the former selects all descendant para elements that are the first para children of their parents."

Nice to haves

Some patterns that we might try to support, if it doesn't hamper the implementation too badly:

Usage

Streamable XPath is used where possible for

Eventually we expect to use it for XSLT, i.e. certain XSLT constraints plus streamable XPath would be recognized as suitable for full streaming operation. But this is a very tricky area, so will probably come last

Typical algorithm for such use:

1) Parse 2) scan all expressions 2.1) If *all* expressions are streamable, operate in streaming mode 2.2) else full tree mode

We could in some cases possibly support a mixed mode--parse with supported exprs, then perform remaining ops on the full tree. For example in XUpdate this would involve splitting the instructions into 2 separate XUpdate docs.

Matching against an Amara pattern

The algorithm for matching a node against a pattern is similar to that of XSLT patterns.

Let's say the node you are testing is candidate-node. Candidate node matches the pattern if and only if there is a node among candidate-node's ancestors, reference-node such that if you execute the pattern as an XPath, with reference-node as the context, candidate-node is included among the results.

That's a bit of a handful, so here is some pseudocode for the concept:

matched = False
reference_node = candidate_node.xml_parent()
while reference_node:
    if candidate_node in reference_node.xml_select(pattern):
        matched = True
        break
    reference_node = reference_node.xml_parent()

Note: for another example, see: http://bitbucket.org/uche/amara/src/tip/lib/bindery/util.py , line 82

See also

Yeah streamable XPath subsets are a perennial hot topic. A sampler:

Amara2/Architecture/Streaming_XPath (last edited 2010-04-25 04:12:31 by UcheOgbuji)