Splitting large ONIX files using XSLT

When processing very large ONIX files – which may contain thousands or tens of thousands of records – it is possible to exceed the memory limit of your computer, particularly when the entire file is parsed and a complete XML document tree is stored in memory. There are a number of solutions (aside from allocating more memory…) which process the file as a stream of records, thus avoiding the need to store a complete document and reducing the memory requirement, but they tend to be complex.

As a quick solution, a large ONIX file can be split into smaller fragments, each one of which can be processed individually. This script (download it here) uses XSLT to split a single large ONIX file into many smaller ONIX files.

It works with both ONIX 2.1 and ONIX 3.0, and with both Reference names and Short tag flavors of each. It has been tested with the Saxon 9.3 XSL engine.

Usage notes

Typical invocation if used at the command line to split a file called ‘myonixfile.xml’:

java –jar {path-to-saxon} myonixfile.xml split-big-onix-file.xsl size="200" basename="myonixfile-split"

Parameter details:

The XSLT can also be used within an XML editing tool such as oXygen. Note that it uses XSLT 2.0, and is not suitable for use with earlier versions of XSLT.

The XSLT file splitter is provided ‘as is’ and is not warranted in any way by EDItEUR. EDItEUR provides no technical support for this tool other than this document. EDItEUR disclaims all responsibility for any errors and for all consequential inconvenience or cost associated with its use. Bug reports and other enquiries may be sent to EDItEUR (info@editeur.org) or to the original author Francis Cave (francis@franciscave.com).