Overhead of xs:integer and managing memory of large maps

classic Classic list List threaded Threaded
5 messages Options
Reply | Threaded
Open this post in threaded view
|

Overhead of xs:integer and managing memory of large maps

David Rudel
Hi all,
I have written a data-mining script that requires the use of several maps, each of which having hundreds of thousands or millions of keys. Each key is either an integer or string. Each value is theoretically a xs:double, being the distance between nodes in a multi-dimensional space.

I'm having trouble with memory and am looking for ways to cut down the amount of space required to hold the map data. I'm writing mainly to ask about various data types I might use to hold the distance value. I only really care about the first 3 or 4 digits of this value.

Does the saxon implementation support arbitrarily sized integers? if so, how much overhead is there in storing something as an xs:integer rather than as some smaller type?

I'm currently trying to use xs:unsignedShort, as I have a sense of the upper bound on the distances and so I can convert each distance to an integer by appropriate scaling.

I could also use xs:float, but that should use twice as much space (but half as much as xs:double).

I'm also considering storing the *keys* of the map as something other than xs:integer. Should I expect much of an improvement by doing so?

Also, my program uses xsl:iterate, and each iteration a new set of maps is created. These new maps are generally refinements of the previous set of maps (some entries get added, some others get purged to save space). 

I'm currently using map:new($old.map,$new.entries) to add in the new entries, but I am then creating a whole new map to purge the ones I don't need. Would it have a smaller memory footprint to use map:remove() instead? I don't know how practical it is use map:remove() when you may have to be removing 50,000 entries. I recall some discussion of a major improvement in how map:remove works in 9.5.1.6.

What if I created the new map all in one line:
map:remove(map:new($old.map,$new.entries),$purged.keys)? (map:remove only takes away one entry at a time, but I could use fold-left to accomplish multiple removals in one single command I assume.)

Any advice is appreciated.

Note: I'm using saxon EE 9.5.1.7

Are there any relevant improvements in 9.6? I assume oXygen will be sending out a release incorporating that version soonish.

-David




--

"A false conclusion, once arrived at and widely accepted is not dislodged easily, and the less it is understood, the more tenaciously it is held." - Cantor's Law of Preservation of Ignorance.

------------------------------------------------------------------------------
Download BIRT iHub F-Type - The Free Enterprise-Grade BIRT Server
from Actuate! Instantly Supercharge Your Business Reports and Dashboards
with Interactivity, Sharing, Native Excel Exports, App Integration & more
Get technology previously reserved for billion-dollar corporations, FREE
http://pubads.g.doubleclick.net/gampad/clk?id=164703151&iu=/4140/ostg.clktrk
_______________________________________________
saxon-help mailing list archived at http://saxon.markmail.org/
[hidden email]
https://lists.sourceforge.net/lists/listinfo/saxon-help 
Reply | Threaded
Open this post in threaded view
|

Re: Overhead of xs:integer and managing memory of large maps

Dimitre Novatchev
On Sat, Dec 6, 2014 at 7:47 AM, David Rudel <[hidden email]> wrote:

> Also, my program uses xsl:iterate, and each iteration a new set of maps is
> created. These new maps are generally refinements of the previous set of
> maps (some entries get added, some others get purged to save space).
>
> I'm currently using map:new($old.map,$new.entries) to add in the new
> entries, but I am then creating a whole new map to purge the ones I don't
> need. Would it have a smaller memory footprint to use map:remove() instead?
> I don't know how practical it is use map:remove() when you may have to be
> removing 50,000 entries. I recall some discussion of a major improvement in
> how map:remove works in 9.5.1.6.


 I don't know how <xsl:iterate> is implemented.

If one uses tail-recursion instead (and AFAIK Saxon recognizes and
optimizes tail calls when the same-named template or function is
called), the references to memory that doesn't get passed in the call
as parameters, should be gone once the call is performed and this
should make all such memory reclaimable by the garbage collector. This
could ensure that the memory of any "old" maps is freed.

Once again, I don't know how Saxon implements tail recursion -- if it
uses lazy evaluation, so that the parameters when passed in the tail
call are still not evaluated, this would prevent freeing any memory
involved in that evaluation. I am not aware if Saxon provides any
language constructs or options for requesting eager vs. lazy
evaluation.

Cheers,
Dimitre

------------------------------------------------------------------------------
Download BIRT iHub F-Type - The Free Enterprise-Grade BIRT Server
from Actuate! Instantly Supercharge Your Business Reports and Dashboards
with Interactivity, Sharing, Native Excel Exports, App Integration & more
Get technology previously reserved for billion-dollar corporations, FREE
http://pubads.g.doubleclick.net/gampad/clk?id=164703151&iu=/4140/ostg.clktrk
_______________________________________________
saxon-help mailing list archived at http://saxon.markmail.org/
[hidden email]
https://lists.sourceforge.net/lists/listinfo/saxon-help 
Reply | Threaded
Open this post in threaded view
|

Re: Overhead of xs:integer and managing memory of large maps

Michael Kay
In reply to this post by David Rudel

> Hi all,
> I have written a data-mining script that requires the use of several maps, each of which having hundreds of thousands or millions of keys. Each key is either an integer or string. Each value is theoretically a xs:double, being the distance between nodes in a multi-dimensional space.
>
> I'm having trouble with memory and am looking for ways to cut down the amount of space required to hold the map data. I'm writing mainly to ask about various data types I might use to hold the distance value. I only really care about the first 3 or 4 digits of this value.
>
> Does the saxon implementation support arbitrarily sized integers? if so, how much overhead is there in storing something as an xs:integer rather than as some smaller type?

You may find it worthwhile looking at the code if you want to understand internal details of the implementation. An xs:integer is represented by the Java abstract class IntegerValue, which has two concrete implementation subclasses, Int64Value and BigIntegerValue. An Int64Value wraps a Java long, while a BigIntegerValue wraps a Java BigInteger. For an integer written as a literal, an Int64Value will be used if the value is within the range of a long. For integers that result from computation, an Int64Value is generally used if the system can readily determine that the result of the computation will be in the range of a long. If your values are always small integers, then storing them as xs:float would be more compact. Using types such as xs:unsignedShort will not help; these use the same implementation classes as xs:integer. For any AtomicValue, there are at least two objects: the AtomicValue contains the Java object representing the value (e.g. a String or a BigInteger) and also a reference to an AtomicType object holding the type label.

Some months ago a user reported a problem with the size of the indexes used to do validation of XSD key/keyref constraints, and I implemented an optimization with gave a dramatic saving for this case using a class called CompactStringValue instead of the general-purpose StringValue: instead of holding a CharSequence plus type label, this simply holds a char[] array. That optimization is only available for this particular case, but it does illustrate that there is potential for improvements of this kind, for example in cases where the keys of a map all have the same data type.

>
> I'm currently trying to use xs:unsignedShort, as I have a sense of the upper bound on the distances and so I can convert each distance to an integer by appropriate scaling.
>
> I could also use xs:float, but that should use twice as much space (but half as much as xs:double).
>
> I'm also considering storing the *keys* of the map as something other than xs:integer. Should I expect much of an improvement by doing so?
>
> Also, my program uses xsl:iterate, and each iteration a new set of maps is created. These new maps are generally refinements of the previous set of maps (some entries get added, some others get purged to save space).
>
> I'm currently using map:new($old.map,$new.entries) to add in the new entries, but I am then creating a whole new map to purge the ones I don't need. Would it have a smaller memory footprint to use map:remove() instead? I don't know how practical it is use map:remove() when you may have to be removing 50,000 entries. I recall some discussion of a major improvement in how map:remove works in 9.5.1.6.

The implementation of maps changes completely in Saxon 9.6, to use a hash trie implementation. The most notable change is that the performance of map:remove is dramatically improved. In 9.5, map:remove is very expensive.

Michael Kay
Saxonica

>
> What if I created the new map all in one line:
> map:remove(map:new($old.map,$new.entries),$purged.keys)? (map:remove only takes away one entry at a time, but I could use fold-left to accomplish multiple removals in one single command I assume.)
>
> Any advice is appreciated.
>
> Note: I'm using saxon EE 9.5.1.7
>
> Are there any relevant improvements in 9.6? I assume oXygen will be sending out a release incorporating that version soonish.
>
> -David
>
>
>
>
> --
>
> "A false conclusion, once arrived at and widely accepted is not dislodged easily, and the less it is understood, the more tenaciously it is held." - Cantor's Law of Preservation of Ignorance.
> ------------------------------------------------------------------------------
> Download BIRT iHub F-Type - The Free Enterprise-Grade BIRT Server
> from Actuate! Instantly Supercharge Your Business Reports and Dashboards
> with Interactivity, Sharing, Native Excel Exports, App Integration & more
> Get technology previously reserved for billion-dollar corporations, FREE
> http://pubads.g.doubleclick.net/gampad/clk?id=164703151&iu=/4140/ostg.clktrk_______________________________________________
> saxon-help mailing list archived at http://saxon.markmail.org/
> [hidden email]
> https://lists.sourceforge.net/lists/listinfo/saxon-help


------------------------------------------------------------------------------
Download BIRT iHub F-Type - The Free Enterprise-Grade BIRT Server
from Actuate! Instantly Supercharge Your Business Reports and Dashboards
with Interactivity, Sharing, Native Excel Exports, App Integration & more
Get technology previously reserved for billion-dollar corporations, FREE
http://pubads.g.doubleclick.net/gampad/clk?id=164703151&iu=/4140/ostg.clktrk
_______________________________________________
saxon-help mailing list archived at http://saxon.markmail.org/
[hidden email]
https://lists.sourceforge.net/lists/listinfo/saxon-help 
Reply | Threaded
Open this post in threaded view
|

Re: Overhead of xs:integer and managing memory of large maps

David Rudel
Thanks, Mike! 
I'll convert from xs:unsignedShort to xs:float.

In your book you indicate that there is little point to these minor integer datatypes, but I thought that perhaps my particular case might have been an exception. Looks like that is not the case.

One last question: you mentioned that in 9.5 that map:remove was very expensive, what about the situation where there are duplicate keys among maps that are being merged using map:new() [now map:merge()]?

-David



On Sun, Dec 7, 2014 at 9:49 PM, Michael Kay <[hidden email]> wrote:

> Hi all,
> I have written a data-mining script that requires the use of several maps, each of which having hundreds of thousands or millions of keys. Each key is either an integer or string. Each value is theoretically a xs:double, being the distance between nodes in a multi-dimensional space.
>
> I'm having trouble with memory and am looking for ways to cut down the amount of space required to hold the map data. I'm writing mainly to ask about various data types I might use to hold the distance value. I only really care about the first 3 or 4 digits of this value.
>
> Does the saxon implementation support arbitrarily sized integers? if so, how much overhead is there in storing something as an xs:integer rather than as some smaller type?

You may find it worthwhile looking at the code if you want to understand internal details of the implementation. An xs:integer is represented by the Java abstract class IntegerValue, which has two concrete implementation subclasses, Int64Value and BigIntegerValue. An Int64Value wraps a Java long, while a BigIntegerValue wraps a Java BigInteger. For an integer written as a literal, an Int64Value will be used if the value is within the range of a long. For integers that result from computation, an Int64Value is generally used if the system can readily determine that the result of the computation will be in the range of a long. If your values are always small integers, then storing them as xs:float would be more compact. Using types such as xs:unsignedShort will not help; these use the same implementation classes as xs:integer. For any AtomicValue, there are at least two objects: the AtomicValue contains the Java object representing the value (e.g. a String or a BigInteger) and also a reference to an AtomicType object holding the type label.

Some months ago a user reported a problem with the size of the indexes used to do validation of XSD key/keyref constraints, and I implemented an optimization with gave a dramatic saving for this case using a class called CompactStringValue instead of the general-purpose StringValue: instead of holding a CharSequence plus type label, this simply holds a char[] array. That optimization is only available for this particular case, but it does illustrate that there is potential for improvements of this kind, for example in cases where the keys of a map all have the same data type.
>
> I'm currently trying to use xs:unsignedShort, as I have a sense of the upper bound on the distances and so I can convert each distance to an integer by appropriate scaling.
>
> I could also use xs:float, but that should use twice as much space (but half as much as xs:double).
>
> I'm also considering storing the *keys* of the map as something other than xs:integer. Should I expect much of an improvement by doing so?
>
> Also, my program uses xsl:iterate, and each iteration a new set of maps is created. These new maps are generally refinements of the previous set of maps (some entries get added, some others get purged to save space).
>
> I'm currently using map:new($old.map,$new.entries) to add in the new entries, but I am then creating a whole new map to purge the ones I don't need. Would it have a smaller memory footprint to use map:remove() instead? I don't know how practical it is use map:remove() when you may have to be removing 50,000 entries. I recall some discussion of a major improvement in how map:remove works in 9.5.1.6.

The implementation of maps changes completely in Saxon 9.6, to use a hash trie implementation. The most notable change is that the performance of map:remove is dramatically improved. In 9.5, map:remove is very expensive.

Michael Kay
Saxonica

>
> What if I created the new map all in one line:
> map:remove(map:new($old.map,$new.entries),$purged.keys)? (map:remove only takes away one entry at a time, but I could use fold-left to accomplish multiple removals in one single command I assume.)
>
> Any advice is appreciated.
>
> Note: I'm using saxon EE 9.5.1.7
>
> Are there any relevant improvements in 9.6? I assume oXygen will be sending out a release incorporating that version soonish.
>
> -David
>
>
>
>
> --
>
> "A false conclusion, once arrived at and widely accepted is not dislodged easily, and the less it is understood, the more tenaciously it is held." - Cantor's Law of Preservation of Ignorance.
> ------------------------------------------------------------------------------
> Download BIRT iHub F-Type - The Free Enterprise-Grade BIRT Server
> from Actuate! Instantly Supercharge Your Business Reports and Dashboards
> with Interactivity, Sharing, Native Excel Exports, App Integration & more
> Get technology previously reserved for billion-dollar corporations, FREE
> http://pubads.g.doubleclick.net/gampad/clk?id=164703151&iu=/4140/ostg.clktrk_______________________________________________
> saxon-help mailing list archived at http://saxon.markmail.org/
> [hidden email]
> https://lists.sourceforge.net/lists/listinfo/saxon-help


------------------------------------------------------------------------------
Download BIRT iHub F-Type - The Free Enterprise-Grade BIRT Server
from Actuate! Instantly Supercharge Your Business Reports and Dashboards
with Interactivity, Sharing, Native Excel Exports, App Integration & more
Get technology previously reserved for billion-dollar corporations, FREE
http://pubads.g.doubleclick.net/gampad/clk?id=164703151&iu=/4140/ostg.clktrk
_______________________________________________
saxon-help mailing list archived at http://saxon.markmail.org/
[hidden email]
https://lists.sourceforge.net/lists/listinfo/saxon-help



--

"A false conclusion, once arrived at and widely accepted is not dislodged easily, and the less it is understood, the more tenaciously it is held." - Cantor's Law of Preservation of Ignorance.

------------------------------------------------------------------------------
Download BIRT iHub F-Type - The Free Enterprise-Grade BIRT Server
from Actuate! Instantly Supercharge Your Business Reports and Dashboards
with Interactivity, Sharing, Native Excel Exports, App Integration & more
Get technology previously reserved for billion-dollar corporations, FREE
http://pubads.g.doubleclick.net/gampad/clk?id=164703151&iu=/4140/ostg.clktrk
_______________________________________________
saxon-help mailing list archived at http://saxon.markmail.org/
[hidden email]
https://lists.sourceforge.net/lists/listinfo/saxon-help 
Reply | Threaded
Open this post in threaded view
|

Re: Overhead of xs:integer and managing memory of large maps

Michael Kay
I think that map:merge()/map:new() in both 9.5 and 9.6 has performance proportional to the total size of the input maps excluding the first. I don't think duplicate keys should impact this much.

Michael Kay
Saxonica
+44 (0) 118 946 5893




On 7 Dec 2014, at 21:54, David Rudel <[hidden email]> wrote:

Thanks, Mike! 
I'll convert from xs:unsignedShort to xs:float.

In your book you indicate that there is little point to these minor integer datatypes, but I thought that perhaps my particular case might have been an exception. Looks like that is not the case.

One last question: you mentioned that in 9.5 that map:remove was very expensive, what about the situation where there are duplicate keys among maps that are being merged using map:new() [now map:merge()]?

-David



On Sun, Dec 7, 2014 at 9:49 PM, Michael Kay <[hidden email]> wrote:

> Hi all,
> I have written a data-mining script that requires the use of several maps, each of which having hundreds of thousands or millions of keys. Each key is either an integer or string. Each value is theoretically a xs:double, being the distance between nodes in a multi-dimensional space.
>
> I'm having trouble with memory and am looking for ways to cut down the amount of space required to hold the map data. I'm writing mainly to ask about various data types I might use to hold the distance value. I only really care about the first 3 or 4 digits of this value.
>
> Does the saxon implementation support arbitrarily sized integers? if so, how much overhead is there in storing something as an xs:integer rather than as some smaller type?

You may find it worthwhile looking at the code if you want to understand internal details of the implementation. An xs:integer is represented by the Java abstract class IntegerValue, which has two concrete implementation subclasses, Int64Value and BigIntegerValue. An Int64Value wraps a Java long, while a BigIntegerValue wraps a Java BigInteger. For an integer written as a literal, an Int64Value will be used if the value is within the range of a long. For integers that result from computation, an Int64Value is generally used if the system can readily determine that the result of the computation will be in the range of a long. If your values are always small integers, then storing them as xs:float would be more compact. Using types such as xs:unsignedShort will not help; these use the same implementation classes as xs:integer. For any AtomicValue, there are at least two objects: the AtomicValue contains the Java object representing the value (e.g. a String or a BigInteger) and also a reference to an AtomicType object holding the type label.

Some months ago a user reported a problem with the size of the indexes used to do validation of XSD key/keyref constraints, and I implemented an optimization with gave a dramatic saving for this case using a class called CompactStringValue instead of the general-purpose StringValue: instead of holding a CharSequence plus type label, this simply holds a char[] array. That optimization is only available for this particular case, but it does illustrate that there is potential for improvements of this kind, for example in cases where the keys of a map all have the same data type.
>
> I'm currently trying to use xs:unsignedShort, as I have a sense of the upper bound on the distances and so I can convert each distance to an integer by appropriate scaling.
>
> I could also use xs:float, but that should use twice as much space (but half as much as xs:double).
>
> I'm also considering storing the *keys* of the map as something other than xs:integer. Should I expect much of an improvement by doing so?
>
> Also, my program uses xsl:iterate, and each iteration a new set of maps is created. These new maps are generally refinements of the previous set of maps (some entries get added, some others get purged to save space).
>
> I'm currently using map:new($old.map,$new.entries) to add in the new entries, but I am then creating a whole new map to purge the ones I don't need. Would it have a smaller memory footprint to use map:remove() instead? I don't know how practical it is use map:remove() when you may have to be removing 50,000 entries. I recall some discussion of a major improvement in how map:remove works in 9.5.1.6.

The implementation of maps changes completely in Saxon 9.6, to use a hash trie implementation. The most notable change is that the performance of map:remove is dramatically improved. In 9.5, map:remove is very expensive.

Michael Kay
Saxonica

>
> What if I created the new map all in one line:
> map:remove(map:new($old.map,$new.entries),$purged.keys)? (map:remove only takes away one entry at a time, but I could use fold-left to accomplish multiple removals in one single command I assume.)
>
> Any advice is appreciated.
>
> Note: I'm using saxon EE 9.5.1.7
>
> Are there any relevant improvements in 9.6? I assume oXygen will be sending out a release incorporating that version soonish.
>
> -David
>
>
>
>
> --
>
> "A false conclusion, once arrived at and widely accepted is not dislodged easily, and the less it is understood, the more tenaciously it is held." - Cantor's Law of Preservation of Ignorance.
> ------------------------------------------------------------------------------
> Download BIRT iHub F-Type - The Free Enterprise-Grade BIRT Server
> from Actuate! Instantly Supercharge Your Business Reports and Dashboards
> with Interactivity, Sharing, Native Excel Exports, App Integration & more
> Get technology previously reserved for billion-dollar corporations, FREE
> http://pubads.g.doubleclick.net/gampad/clk?id=164703151&iu=/4140/ostg.clktrk_______________________________________________
> saxon-help mailing list archived at http://saxon.markmail.org/
> [hidden email]
> https://lists.sourceforge.net/lists/listinfo/saxon-help


------------------------------------------------------------------------------
Download BIRT iHub F-Type - The Free Enterprise-Grade BIRT Server
from Actuate! Instantly Supercharge Your Business Reports and Dashboards
with Interactivity, Sharing, Native Excel Exports, App Integration & more
Get technology previously reserved for billion-dollar corporations, FREE
http://pubads.g.doubleclick.net/gampad/clk?id=164703151&iu=/4140/ostg.clktrk
_______________________________________________
saxon-help mailing list archived at http://saxon.markmail.org/
[hidden email]
https://lists.sourceforge.net/lists/listinfo/saxon-help



--

"A false conclusion, once arrived at and widely accepted is not dislodged easily, and the less it is understood, the more tenaciously it is held." - Cantor's Law of Preservation of Ignorance.
------------------------------------------------------------------------------
Download BIRT iHub F-Type - The Free Enterprise-Grade BIRT Server
from Actuate! Instantly Supercharge Your Business Reports and Dashboards
with Interactivity, Sharing, Native Excel Exports, App Integration & more
Get technology previously reserved for billion-dollar corporations, FREE
http://pubads.g.doubleclick.net/gampad/clk?id=164703151&iu=/4140/ostg.clktrk_______________________________________________
saxon-help mailing list archived at http://saxon.markmail.org/
[hidden email]
https://lists.sourceforge.net/lists/listinfo/saxon-help


------------------------------------------------------------------------------
Download BIRT iHub F-Type - The Free Enterprise-Grade BIRT Server
from Actuate! Instantly Supercharge Your Business Reports and Dashboards
with Interactivity, Sharing, Native Excel Exports, App Integration & more
Get technology previously reserved for billion-dollar corporations, FREE
http://pubads.g.doubleclick.net/gampad/clk?id=164703151&iu=/4140/ostg.clktrk
_______________________________________________
saxon-help mailing list archived at http://saxon.markmail.org/
[hidden email]
https://lists.sourceforge.net/lists/listinfo/saxon-help