Processing large document without streaming

classic Classic list List threaded Threaded
9 messages Options
Reply | Threaded
Open this post in threaded view
|

Processing large document without streaming

John Platt
Preface:  We run our saxon (currently version 9.1.0.3) process as part of a EAR that runs on Websphere.

We currently have an issue where a 300MB document is contributing to out of memory because (we believe) it is reading the whole DOM structure into memory then attempting to process.  I have looked into the streaming functionality as part of the 9.6 release but dont believe we can take advantage of it due to the limitation on the xpath that can be used.  We are currently making changes that will allow us to expand the available memory, but my fear is we will still run into a threshold eventually.

I am wondering if anyone has any other suggestions on ways to reduce the memory impact given that we cannot limit the pieces of the xml document brought in due to needing to reference different portions at any given time.
Reply | Threaded
Open this post in threaded view
|

Re: Processing large document without streaming

Michael Kay
Well, it rather depends on the processing you are doing.

Firstly, is it failing while loading/parsing the document, or later during transformation? 300Mb should be possible these days if you've got a reasonable amount of memory available. How much memory are you allocating?

Apart from streaming, two techniques worth looking at are document projection, and document splitting. Document projection trims down the document to retain only the parts of it that you actually need: basically you write an XQuery to create the trimmed-down document, and run that query with projection enabled, and it builds the minimum tree needed to process the query. Document splitting is useful if your transformation can be written to operate on the large document "one element at a time".

The other suggestion I would make is to look at the transformation to see if it is creating any large intermediate data structures, typically held in variables. Sometimes, for example, people create unnecessary copies of nodes when they could just as well work with references to the original nodes.


Michael Kay
Saxonica
[hidden email]
+44 (0) 118 946 5893




On 28 Jan 2015, at 15:13, John Platt <[hidden email]> wrote:

> Preface:  We run our saxon (currently version 9.1.0.3) process as part of a
> EAR that runs on Websphere.
>
> We currently have an issue where a 300MB document is contributing to out of
> memory because (we believe) it is reading the whole DOM structure into
> memory then attempting to process.  I have looked into the streaming
> functionality as part of the 9.6 release but dont believe we can take
> advantage of it due to the limitation on the xpath that can be used.  We are
> currently making changes that will allow us to expand the available memory,
> but my fear is we will still run into a threshold eventually.
>
> I am wondering if anyone has any other suggestions on ways to reduce the
> memory impact given that we cannot limit the pieces of the xml document
> brought in due to needing to reference different portions at any given time.
>
>
>
> --
> View this message in context: http://saxon-xslt-and-xquery-processor.13853.n7.nabble.com/Processing-large-document-without-streaming-tp13548.html
> Sent from the saxon-help mailing list archive at Nabble.com.
>
> ------------------------------------------------------------------------------
> Dive into the World of Parallel Programming. The Go Parallel Website,
> sponsored by Intel and developed in partnership with Slashdot Media, is your
> hub for all things parallel software development, from weekly thought
> leadership blogs to news, videos, case studies, tutorials and more. Take a
> look and join the conversation now. http://goparallel.sourceforge.net/
> _______________________________________________
> saxon-help mailing list archived at http://saxon.markmail.org/
> [hidden email]
> https://lists.sourceforge.net/lists/listinfo/saxon-help 


------------------------------------------------------------------------------
Dive into the World of Parallel Programming. The Go Parallel Website,
sponsored by Intel and developed in partnership with Slashdot Media, is your
hub for all things parallel software development, from weekly thought
leadership blogs to news, videos, case studies, tutorials and more. Take a
look and join the conversation now. http://goparallel.sourceforge.net/
_______________________________________________
saxon-help mailing list archived at http://saxon.markmail.org/
[hidden email]
https://lists.sourceforge.net/lists/listinfo/saxon-help 
Reply | Threaded
Open this post in threaded view
|

Re: Processing large document without streaming

Ihe Onwuka-2
I had this problem when I was trying to find solve a best match for something and that required searching the entire document which led to out of space heap errors.

My solution was to split the document into roughly equal parts find the best match against each part and then find the best match of the best matches.

Dr Kay mentioned to me in the past that the limitation is on the number of nodes not necessarily the file size so another option would be to make a lossless (i.e reversible) denormalizing transformation to run against. For example attributes to semi colon separated key value pairs, sequence of elements to one element with semi-colon separated content etc.

Another route I have taken is to pre-process the file with an awk (and other streamable friends) script in a way that mitigates the problem.

On Sun, Feb 1, 2015 at 8:13 AM, Michael Kay <[hidden email]> wrote:
Well, it rather depends on the processing you are doing.

Firstly, is it failing while loading/parsing the document, or later during transformation? 300Mb should be possible these days if you've got a reasonable amount of memory available. How much memory are you allocating?

Apart from streaming, two techniques worth looking at are document projection, and document splitting. Document projection trims down the document to retain only the parts of it that you actually need: basically you write an XQuery to create the trimmed-down document, and run that query with projection enabled, and it builds the minimum tree needed to process the query. Document splitting is useful if your transformation can be written to operate on the large document "one element at a time".

The other suggestion I would make is to look at the transformation to see if it is creating any large intermediate data structures, typically held in variables. Sometimes, for example, people create unnecessary copies of nodes when they could just as well work with references to the original nodes.


Michael Kay
Saxonica
[hidden email]
<a href="tel:%2B44%20%280%29%20118%20946%205893" value="+441189465893">+44 (0) 118 946 5893




On 28 Jan 2015, at 15:13, John Platt <[hidden email]> wrote:

> Preface:  We run our saxon (currently version 9.1.0.3) process as part of a
> EAR that runs on Websphere.
>
> We currently have an issue where a 300MB document is contributing to out of
> memory because (we believe) it is reading the whole DOM structure into
> memory then attempting to process.  I have looked into the streaming
> functionality as part of the 9.6 release but dont believe we can take
> advantage of it due to the limitation on the xpath that can be used.  We are
> currently making changes that will allow us to expand the available memory,
> but my fear is we will still run into a threshold eventually.
>
> I am wondering if anyone has any other suggestions on ways to reduce the
> memory impact given that we cannot limit the pieces of the xml document
> brought in due to needing to reference different portions at any given time.
>
>
>
> --
> View this message in context: http://saxon-xslt-and-xquery-processor.13853.n7.nabble.com/Processing-large-document-without-streaming-tp13548.html
> Sent from the saxon-help mailing list archive at Nabble.com.
>
> ------------------------------------------------------------------------------
> Dive into the World of Parallel Programming. The Go Parallel Website,
> sponsored by Intel and developed in partnership with Slashdot Media, is your
> hub for all things parallel software development, from weekly thought
> leadership blogs to news, videos, case studies, tutorials and more. Take a
> look and join the conversation now. http://goparallel.sourceforge.net/
> _______________________________________________
> saxon-help mailing list archived at http://saxon.markmail.org/
> [hidden email]
> https://lists.sourceforge.net/lists/listinfo/saxon-help


------------------------------------------------------------------------------
Dive into the World of Parallel Programming. The Go Parallel Website,
sponsored by Intel and developed in partnership with Slashdot Media, is your
hub for all things parallel software development, from weekly thought
leadership blogs to news, videos, case studies, tutorials and more. Take a
look and join the conversation now. http://goparallel.sourceforge.net/
_______________________________________________
saxon-help mailing list archived at http://saxon.markmail.org/
[hidden email]
https://lists.sourceforge.net/lists/listinfo/saxon-help


------------------------------------------------------------------------------
Dive into the World of Parallel Programming. The Go Parallel Website,
sponsored by Intel and developed in partnership with Slashdot Media, is your
hub for all things parallel software development, from weekly thought
leadership blogs to news, videos, case studies, tutorials and more. Take a
look and join the conversation now. http://goparallel.sourceforge.net/
_______________________________________________
saxon-help mailing list archived at http://saxon.markmail.org/
[hidden email]
https://lists.sourceforge.net/lists/listinfo/saxon-help 
Reply | Threaded
Open this post in threaded view
|

Re: Processing large document without streaming

David Rudel
In reply to this post by John Platt
John,
How much RAM do you have available?
I process large documents very frequently, sometimes using streaming
sometimes without and I don't think that 300MB should be too much to
handle in an of itself for most modern computers. I just ran a test
run of a script that processes a 400MB document, allocating 4GB of
RAM, and it did not have problems.

A few possibilities:
*)Your XSLT may be written in a way that the internal processing is
memory-inefficient.
**) There may be significant improvements between 9.1 and the current
version that I don't know of. Perhaps try running the same script with
one of the newer incarnations of Saxon?

I'd be happy to take a look at your XSL script if you want.

-David





On Wed, Jan 28, 2015 at 4:13 PM, John Platt <[hidden email]> wrote:

> Preface:  We run our saxon (currently version 9.1.0.3) process as part of a
> EAR that runs on Websphere.
>
> We currently have an issue where a 300MB document is contributing to out of
> memory because (we believe) it is reading the whole DOM structure into
> memory then attempting to process.  I have looked into the streaming
> functionality as part of the 9.6 release but dont believe we can take
> advantage of it due to the limitation on the xpath that can be used.  We are
> currently making changes that will allow us to expand the available memory,
> but my fear is we will still run into a threshold eventually.
>
> I am wondering if anyone has any other suggestions on ways to reduce the
> memory impact given that we cannot limit the pieces of the xml document
> brought in due to needing to reference different portions at any given time.
>
>
>
> --
> View this message in context: http://saxon-xslt-and-xquery-processor.13853.n7.nabble.com/Processing-large-document-without-streaming-tp13548.html
> Sent from the saxon-help mailing list archive at Nabble.com.
>
> ------------------------------------------------------------------------------
> Dive into the World of Parallel Programming. The Go Parallel Website,
> sponsored by Intel and developed in partnership with Slashdot Media, is your
> hub for all things parallel software development, from weekly thought
> leadership blogs to news, videos, case studies, tutorials and more. Take a
> look and join the conversation now. http://goparallel.sourceforge.net/
> _______________________________________________
> saxon-help mailing list archived at http://saxon.markmail.org/
> [hidden email]
> https://lists.sourceforge.net/lists/listinfo/saxon-help



--

"A false conclusion, once arrived at and widely accepted is not
dislodged easily, and the less it is understood, the more tenaciously
it is held." - Cantor's Law of Preservation of Ignorance.

------------------------------------------------------------------------------
Dive into the World of Parallel Programming. The Go Parallel Website,
sponsored by Intel and developed in partnership with Slashdot Media, is your
hub for all things parallel software development, from weekly thought
leadership blogs to news, videos, case studies, tutorials and more. Take a
look and join the conversation now. http://goparallel.sourceforge.net/
_______________________________________________
saxon-help mailing list archived at http://saxon.markmail.org/
[hidden email]
https://lists.sourceforge.net/lists/listinfo/saxon-help 
Reply | Threaded
Open this post in threaded view
|

Re: Processing large document without streaming

Michael Kay
In reply to this post by Ihe Onwuka-2
>
> Dr Kay mentioned to me in the past that the limitation is on the number of nodes not necessarily the file size

You're not going to hit a limitation on the number of nodes, or on the size of their string value, with a 300Mb file. You should be able to get up at least to 2Gb before you start hitting limits caused by 32-bit sizes and offsets.

Michael Kay
Saxonica
------------------------------------------------------------------------------
Dive into the World of Parallel Programming. The Go Parallel Website,
sponsored by Intel and developed in partnership with Slashdot Media, is your
hub for all things parallel software development, from weekly thought
leadership blogs to news, videos, case studies, tutorials and more. Take a
look and join the conversation now. http://goparallel.sourceforge.net/
_______________________________________________
saxon-help mailing list archived at http://saxon.markmail.org/
[hidden email]
https://lists.sourceforge.net/lists/listinfo/saxon-help 
Reply | Threaded
Open this post in threaded view
|

Re: Processing large document without streaming

Ihe Onwuka-2
I have  definitely run into out-of-space heap errors with < 300MB (in fact as I recall < 200MB) files - perhaps it's the machine configuration I'm using. 

The larger point though is that there is some threshold beyond which  some sort of size related countermeasures become necessary.


On Sun, Feb 1, 2015 at 8:37 AM, Michael Kay <[hidden email]> wrote:
>
> Dr Kay mentioned to me in the past that the limitation is on the number of nodes not necessarily the file size

You're not going to hit a limitation on the number of nodes, or on the size of their string value, with a 300Mb file. You should be able to get up at least to 2Gb before you start hitting limits caused by 32-bit sizes and offsets.

Michael Kay
Saxonica


------------------------------------------------------------------------------
Dive into the World of Parallel Programming. The Go Parallel Website,
sponsored by Intel and developed in partnership with Slashdot Media, is your
hub for all things parallel software development, from weekly thought
leadership blogs to news, videos, case studies, tutorials and more. Take a
look and join the conversation now. http://goparallel.sourceforge.net/
_______________________________________________
saxon-help mailing list archived at http://saxon.markmail.org/
[hidden email]
https://lists.sourceforge.net/lists/listinfo/saxon-help 
Reply | Threaded
Open this post in threaded view
|

Re: Processing large document without streaming

John Platt
In reply to this post by Michael Kay
Our highest level of RAM is 1024.  We allow parallel processing through our service and are currently hampered by 32 bit RAM limitations (working on that second part).  Our service is also not the only process running so max RAM can be lower (without parallel in that instance)

I can look into projection and document splitting, but I fear because of the setup of our document we would run into issues there as well.  Due to a lot of repeated data, we have the XL document broken into chunks already to limit the markup, because of that though we have to reference those different trees a lot.  This would make a proper path for projection difficult.  In a similar way document splitting would be difficult because we have a lot of duplication exclusion logic and a need to sort the entire set of data for display.

I know at one point we made a change away from copy-of to the use of sequences when we process subsets of collections of data, but we have to have those subset groups.

I plan to look at the newest version of Saxon for sure to take advantage of any processing gains but what will my next cap be even if the 300mb file is processed fine.
Reply | Threaded
Open this post in threaded view
|

Re: Processing large document without streaming

David Rudel
Any chance you goals can be accomplished via XQuery rather than XSL?

On Sun, Feb 1, 2015 at 5:03 PM, John Platt <[hidden email]> wrote:

> Our highest level of RAM is 1024.  We allow parallel processing through our
> service and are currently hampered by 32 bit RAM limitations (working on
> that second part).  Our service is also not the only process running so max
> RAM can be lower (without parallel in that instance)
>
> I can look into projection and document splitting, but I fear because of the
> setup of our document we would run into issues there as well.  Due to a lot
> of repeated data, we have the XL document broken into chunks already to
> limit the markup, because of that though we have to reference those
> different trees a lot.  This would make a proper path for projection
> difficult.  In a similar way document splitting would be difficult because
> we have a lot of duplication exclusion logic and a need to sort the entire
> set of data for display.
>
> I know at one point we made a change away from copy-of to the use of
> sequences when we process subsets of collections of data, but we have to
> have those subset groups.
>
> I plan to look at the newest version of Saxon for sure to take advantage of
> any processing gains but what will my next cap be even if the 300mb file is
> processed fine.
>
>
>
> --
> View this message in context: http://saxon-xslt-and-xquery-processor.13853.n7.nabble.com/Processing-large-document-without-streaming-tp13548p13556.html
> Sent from the saxon-help mailing list archive at Nabble.com.
>
> ------------------------------------------------------------------------------
> Dive into the World of Parallel Programming. The Go Parallel Website,
> sponsored by Intel and developed in partnership with Slashdot Media, is your
> hub for all things parallel software development, from weekly thought
> leadership blogs to news, videos, case studies, tutorials and more. Take a
> look and join the conversation now. http://goparallel.sourceforge.net/
> _______________________________________________
> saxon-help mailing list archived at http://saxon.markmail.org/
> [hidden email]
> https://lists.sourceforge.net/lists/listinfo/saxon-help



--

"A false conclusion, once arrived at and widely accepted is not
dislodged easily, and the less it is understood, the more tenaciously
it is held." - Cantor's Law of Preservation of Ignorance.

------------------------------------------------------------------------------
Dive into the World of Parallel Programming. The Go Parallel Website,
sponsored by Intel and developed in partnership with Slashdot Media, is your
hub for all things parallel software development, from weekly thought
leadership blogs to news, videos, case studies, tutorials and more. Take a
look and join the conversation now. http://goparallel.sourceforge.net/
_______________________________________________
saxon-help mailing list archived at http://saxon.markmail.org/
[hidden email]
https://lists.sourceforge.net/lists/listinfo/saxon-help 
Reply | Threaded
Open this post in threaded view
|

Re: Processing large document without streaming

John Platt
no, XQuery isnt an option.  We use a lot of the functionality only provided in xslt.