parallel xslt processing of large xml documents - xml prague 2015

Parallel XSLT Processing of Large Documents

Jakub Maly, Barclays@[email protected] XML Prague 2015

Reminder on streaming…

Can now process huge documents in bounded memory

A whole new area where XSLT is now applicable

With trade-offs stylesheet must follow streamability rules

limited XPath

XSLT 3.0 only, only in commercial products

Large documents take long time to process processing time dominated by the time required to parse the input

Motivation

Simple input XML structure, 700MB in size

Simple XSLT

Takes 35s to process…

<ProteinEntry id="CCMQR"> <header> <uid>CCMQR</uid> <accession>A00003</accession> <created_date>17-Mar-1987</created_date> <seq-rev_date>17-Mar-1987</seq-rev_date> <txt-rev_date>03-Mar-2000</txt-rev_date> </header> <protein> <name>cytochrome c</name> </protein>...</ProteinEntry>

Why so long?

I/O is not a problem (SSDs are fast enough)

We are using streaming, so memory consumption is constant (bounded)

Processor runs on 100% but just one of the cores…

Space for optimization?

Multi-core machines are ubiquitous

XSLT processor should use all cores if possible

Parsing + processing in multiple threads and then merge the outputs

Results

Trade-offs

One processor thread can’t see data processed by other threads The document has to consist of fairly independent “records”

can be processed separately

As in streaming, we can’t “go back”

and crotches like accumulators won’t work

And sometimes can’t even “go up” (out of the record)

Requirements #1 (input)

The document has a well-defined structure (schema)

A major part of the content is in a sequence of nodes of certain types (we will call these core types)

Core types and their ancestors are not recursive.

Contents of core types are reasonably independent.

We expect that processing of each record takes similar amount of time

Input can readable by multiple threads from random positions

Requirements #2 (stylesheet)

Streamable

Explicitly marked templates for core nodes

Paths in those templates are absolute and use only child axis and element names

alternatively: provide schema

Only the core node and it’s subtree can be accessed by XPath

match="/ProteinDatabase/ProteinEntry"

pxsl:core="yes"

Special cases

If we know more about the structure, we can access more data safely, e.g. If all core nodes are children of one node

We can read from „intro“ in all threads

Special cases #2

If all core nodes are not children of one node Maybe we could choose different layer of

nodes as core nodes

Parsing problems

Possible issues when splitting the document comments, PIs, CDATA

Solutions

report error

preprocessing

with „fast“ XML parser

non XML-aware

?

<ProteinEntry>...</ProteinEntry>

Side-effect problem

Parallelization can produce unexpected results

Side-effects defined by the language, e.g. xsl:message Could be buffered/concatenated

Others Vendor-specific extensions

User extensions

Solutions?

Experimental implementation

Thin wrapper around Saxon EE 9.6, written in Java

1. Split the documents into portions of roughly the same size

2. Turn each portion into a well-formed XML (by adding a small prefix/suffix)

3. Run an instance of Saxon on each portion

4. Merge the results when all threads finish

https://github.com/j-maly/pXSLT

Use Case

RUIAN = DB of geographical, municipal information, XML Prague = 614 MB of data

Simple format Records for streets, buildings, …

Task: split the large file into individual records (each in one XML file) Takes 42 minutes in Saxon EE

Conclusion

Processing in multiple threads provides measurable speed-up

Imposes additional limitations on the stylesheet and input

Described approach makes sense only for large documents (for documents that fit into memory, other solutions are already

available, e.g. saxon:threads)

https://github.com/j-maly/pXSLT

parallel xslt processing of large xml documents - xml prague 2015

Software

core types core types

core nodes paths

contents of core types

xml prague

xml file

wellformed xml

processing time

multicore machines