regex changing \s to \s+ causes out of memory error

classic Classic list List threaded Threaded
6 messages Options
Reply | Threaded
Open this post in threaded view
|

regex changing \s to \s+ causes out of memory error

Ihe Onwuka-2
Ok this is probably not going to be that helpful but 

<xsl:for-each select="tokenize(replace(unparsed-text('sortedAKA.list'),'\n\s','&#x9;','m'),'\n')">

etc 

works

and 


<xsl:for-each select="tokenize(replace(unparsed-text('sortedAKA.list'),'\n\s+','&#x9;','m'),'\n')">

where the \s is changed to \s+

gives a heap space error. 

I expect I am going to be asked for the rest of the stylesheet and some data but just in case it's a known problem I posted first.

Running saxon 9.5.


------------------------------------------------------------------------------
Comprehensive Server Monitoring with Site24x7.
Monitor 10 servers for $9/Month.
Get alerted through email, SMS, voice calls or mobile push notifications.
Take corrective actions from your mobile device.
http://pubads.g.doubleclick.net/gampad/clk?id=154624111&iu=/4140/ostg.clktrk
_______________________________________________
saxon-help mailing list archived at http://saxon.markmail.org/
[hidden email]
https://lists.sourceforge.net/lists/listinfo/saxon-help 
Reply | Threaded
Open this post in threaded view
|

Re: regex changing \s to \s+ causes out of memory error

Ihe Onwuka-2
Actually the stylesheet isn't that long 

<?xml version="1.0" encoding="UTF-8"?>
<xsl:stylesheet
        xmlns:xs="http://www.w3.org/2001/XMLSchema
        xmlns:xsl="http://www.w3.org/1999/XSL/Transform
exclude-result-prefixes="xs " version="2.0">

  <xsl:strip-space elements="*"/>
  <xsl:output indent="yes" method="xml" omit-xml-declaration="yes"/>

  <xsl:template name="main">
    <movies>
      <!-- Aliases are space indented separated by  a newline. Change them to be tab separated on the same line -->
      <xsl:for-each select="tokenize(replace(unparsed-text('sortedAKA.list'),'\n\s','&#x9;','m'),'\n')">
         <xsl:element name="{if (starts-with(.,'&quot;')) then 'tv' else 'movie'}">
           <xsl:variable name="aliases">
             <xsl:for-each select="tokenize(.,'\t')">
               <alias><xsl:value-of select="."/></alias>
             </xsl:for-each>
           </xsl:variable>
           <xsl:apply-templates select="$aliases"/>
         </xsl:element>
      </xsl:for-each>
    </movies>
  </xsl:template>

  <xsl:template match="alias[1]" name="alias">
    <xsl:attribute name="title" select="tokenize(.,'^\s*&#x22;|&#x22;\s*\(.+$')[2]"/>
    <xsl:attribute name="year" select="tokenize(.,'[()]')[last() - 1]"/>
  </xsl:template>

  <xsl:template match="alias">
    <xsl:copy>
      <xsl:call-template name="alias"/>
    </xsl:copy>
  </xsl:template>
  
</xsl:stylesheet>


The data looks like this

"$#*! My Dad Says" (2010)
   "Beep My Dad Says" (2010)
   "Shit My Dad Says" (2010)
   "Shit! My Dad Says" (2013)
"$40 a Day" (2002)
   "Forty Dollars a Day" (2002)

and there are 591k lines of it.

On Thu, Nov 13, 2014 at 9:02 PM, Ihe Onwuka <[hidden email]> wrote:
Ok this is probably not going to be that helpful but 

<xsl:for-each select="tokenize(replace(unparsed-text('sortedAKA.list'),'\n\s','&#x9;','m'),'\n')">

etc 

works

and 


<xsl:for-each select="tokenize(replace(unparsed-text('sortedAKA.list'),'\n\s+','&#x9;','m'),'\n')">

where the \s is changed to \s+

gives a heap space error. 

I expect I am going to be asked for the rest of the stylesheet and some data but just in case it's a known problem I posted first.

Running saxon 9.5.



------------------------------------------------------------------------------
Comprehensive Server Monitoring with Site24x7.
Monitor 10 servers for $9/Month.
Get alerted through email, SMS, voice calls or mobile push notifications.
Take corrective actions from your mobile device.
http://pubads.g.doubleclick.net/gampad/clk?id=154624111&iu=/4140/ostg.clktrk
_______________________________________________
saxon-help mailing list archived at http://saxon.markmail.org/
[hidden email]
https://lists.sourceforge.net/lists/listinfo/saxon-help 
Reply | Threaded
Open this post in threaded view
|

Re: regex changing \s to \s+ causes out of memory error

Michael Kay
In reply to this post by Ihe Onwuka-2
With the 9.5 regex engine (derived from Apache Jakarta) backtracking on a longish input text can be very expensive. This is greatly improved in 9.6.

I'm a little surprised that this should apply to these simple regular expressions - but I'd be grateful if you could see if the problem goes away with 9.6.


Michael Kay
Saxonica
+44 (0) 118 946 5893




On 13 Nov 2014, at 21:02, Ihe Onwuka <[hidden email]> wrote:

Ok this is probably not going to be that helpful but 

<xsl:for-each select="tokenize(replace(unparsed-text('sortedAKA.list'),'\n\s','&#x9;','m'),'\n')">

etc 

works

and 


<xsl:for-each select="tokenize(replace(unparsed-text('sortedAKA.list'),'\n\s+','&#x9;','m'),'\n')">

where the \s is changed to \s+

gives a heap space error. 

I expect I am going to be asked for the rest of the stylesheet and some data but just in case it's a known problem I posted first.

Running saxon 9.5.

------------------------------------------------------------------------------
Comprehensive Server Monitoring with Site24x7.
Monitor 10 servers for $9/Month.
Get alerted through email, SMS, voice calls or mobile push notifications.
Take corrective actions from your mobile device.
http://pubads.g.doubleclick.net/gampad/clk?id=154624111&iu=/4140/ostg.clktrk_______________________________________________
saxon-help mailing list archived at http://saxon.markmail.org/
[hidden email]
https://lists.sourceforge.net/lists/listinfo/saxon-help


------------------------------------------------------------------------------
Comprehensive Server Monitoring with Site24x7.
Monitor 10 servers for $9/Month.
Get alerted through email, SMS, voice calls or mobile push notifications.
Take corrective actions from your mobile device.
http://pubads.g.doubleclick.net/gampad/clk?id=154624111&iu=/4140/ostg.clktrk
_______________________________________________
saxon-help mailing list archived at http://saxon.markmail.org/
[hidden email]
https://lists.sourceforge.net/lists/listinfo/saxon-help 
Reply | Threaded
Open this post in threaded view
|

Re: regex changing \s to \s+ causes out of memory error

Ihe Onwuka-2
Ok. Will let you know.

On Thu, Nov 13, 2014 at 11:18 PM, Michael Kay <[hidden email]> wrote:
With the 9.5 regex engine (derived from Apache Jakarta) backtracking on a longish input text can be very expensive. This is greatly improved in 9.6.

I'm a little surprised that this should apply to these simple regular expressions - but I'd be grateful if you could see if the problem goes away with 9.6.


Michael Kay
Saxonica
<a href="tel:%2B44%20%280%29%20118%20946%205893" value="+441189465893" target="_blank">+44 (0) 118 946 5893




On 13 Nov 2014, at 21:02, Ihe Onwuka <[hidden email]> wrote:

Ok this is probably not going to be that helpful but 

<xsl:for-each select="tokenize(replace(unparsed-text('sortedAKA.list'),'\n\s','&#x9;','m'),'\n')">

etc 

works

and 


<xsl:for-each select="tokenize(replace(unparsed-text('sortedAKA.list'),'\n\s+','&#x9;','m'),'\n')">

where the \s is changed to \s+

gives a heap space error. 

I expect I am going to be asked for the rest of the stylesheet and some data but just in case it's a known problem I posted first.

Running saxon 9.5.

------------------------------------------------------------------------------
Comprehensive Server Monitoring with Site24x7.
Monitor 10 servers for $9/Month.
Get alerted through email, SMS, voice calls or mobile push notifications.
Take corrective actions from your mobile device.
http://pubads.g.doubleclick.net/gampad/clk?id=154624111&iu=/4140/ostg.clktrk_______________________________________________
saxon-help mailing list archived at http://saxon.markmail.org/
[hidden email]
https://lists.sourceforge.net/lists/listinfo/saxon-help



------------------------------------------------------------------------------
Comprehensive Server Monitoring with Site24x7.
Monitor 10 servers for $9/Month.
Get alerted through email, SMS, voice calls or mobile push notifications.
Take corrective actions from your mobile device.
http://pubads.g.doubleclick.net/gampad/clk?id=154624111&iu=/4140/ostg.clktrk
_______________________________________________
saxon-help mailing list archived at http://saxon.markmail.org/
[hidden email]
https://lists.sourceforge.net/lists/listinfo/saxon-help 
Reply | Threaded
Open this post in threaded view
|

Re: regex changing \s to \s+ causes out of memory error

Ihe Onwuka-2
Seems ok.

So I've gone on to 9.6 earlier than planned, I recall seeing something about which features of 3.0 were supported in HE but now I can't find it. 

Where should I be looking?

On Fri, Nov 14, 2014 at 3:40 AM, Ihe Onwuka <[hidden email]> wrote:
Ok. Will let you know.

On Thu, Nov 13, 2014 at 11:18 PM, Michael Kay <[hidden email]> wrote:
With the 9.5 regex engine (derived from Apache Jakarta) backtracking on a longish input text can be very expensive. This is greatly improved in 9.6.

I'm a little surprised that this should apply to these simple regular expressions - but I'd be grateful if you could see if the problem goes away with 9.6.


Michael Kay
Saxonica
<a href="tel:%2B44%20%280%29%20118%20946%205893" value="+441189465893" target="_blank">+44 (0) 118 946 5893




On 13 Nov 2014, at 21:02, Ihe Onwuka <[hidden email]> wrote:

Ok this is probably not going to be that helpful but 

<xsl:for-each select="tokenize(replace(unparsed-text('sortedAKA.list'),'\n\s','&#x9;','m'),'\n')">

etc 

works

and 


<xsl:for-each select="tokenize(replace(unparsed-text('sortedAKA.list'),'\n\s+','&#x9;','m'),'\n')">

where the \s is changed to \s+

gives a heap space error. 

I expect I am going to be asked for the rest of the stylesheet and some data but just in case it's a known problem I posted first.

Running saxon 9.5.

------------------------------------------------------------------------------
Comprehensive Server Monitoring with Site24x7.
Monitor 10 servers for $9/Month.
Get alerted through email, SMS, voice calls or mobile push notifications.
Take corrective actions from your mobile device.
http://pubads.g.doubleclick.net/gampad/clk?id=154624111&iu=/4140/ostg.clktrk_______________________________________________
saxon-help mailing list archived at http://saxon.markmail.org/
[hidden email]
https://lists.sourceforge.net/lists/listinfo/saxon-help




------------------------------------------------------------------------------
Comprehensive Server Monitoring with Site24x7.
Monitor 10 servers for $9/Month.
Get alerted through email, SMS, voice calls or mobile push notifications.
Take corrective actions from your mobile device.
http://pubads.g.doubleclick.net/gampad/clk?id=154624111&iu=/4140/ostg.clktrk
_______________________________________________
saxon-help mailing list archived at http://saxon.markmail.org/
[hidden email]
https://lists.sourceforge.net/lists/listinfo/saxon-help 
Reply | Threaded
Open this post in threaded view
|

Re: regex changing \s to \s+ causes out of memory error

Michael Kay
Try this page:



Michael Kay
Saxonica
+44 (0) 118 946 5893




On 14 Nov 2014, at 18:49, Ihe Onwuka <[hidden email]> wrote:

Seems ok.

So I've gone on to 9.6 earlier than planned, I recall seeing something about which features of 3.0 were supported in HE but now I can't find it. 

Where should I be looking?

On Fri, Nov 14, 2014 at 3:40 AM, Ihe Onwuka <[hidden email]> wrote:
Ok. Will let you know.

On Thu, Nov 13, 2014 at 11:18 PM, Michael Kay <[hidden email]> wrote:
With the 9.5 regex engine (derived from Apache Jakarta) backtracking on a longish input text can be very expensive. This is greatly improved in 9.6.

I'm a little surprised that this should apply to these simple regular expressions - but I'd be grateful if you could see if the problem goes away with 9.6.


Michael Kay
Saxonica
<a href="tel:%2B44%20%280%29%20118%20946%205893" value="+441189465893" target="_blank">+44 (0) 118 946 5893




On 13 Nov 2014, at 21:02, Ihe Onwuka <[hidden email]> wrote:

Ok this is probably not going to be that helpful but 

<xsl:for-each select="tokenize(replace(unparsed-text('sortedAKA.list'),'\n\s','&#x9;','m'),'\n')">

etc 

works

and 


<xsl:for-each select="tokenize(replace(unparsed-text('sortedAKA.list'),'\n\s+','&#x9;','m'),'\n')">

where the \s is changed to \s+

gives a heap space error. 

I expect I am going to be asked for the rest of the stylesheet and some data but just in case it's a known problem I posted first.

Running saxon 9.5.

------------------------------------------------------------------------------
Comprehensive Server Monitoring with Site24x7.
Monitor 10 servers for $9/Month.
Get alerted through email, SMS, voice calls or mobile push notifications.
Take corrective actions from your mobile device.
http://pubads.g.doubleclick.net/gampad/clk?id=154624111&iu=/4140/ostg.clktrk_______________________________________________
saxon-help mailing list archived at http://saxon.markmail.org/
[hidden email]
https://lists.sourceforge.net/lists/listinfo/saxon-help





------------------------------------------------------------------------------
Comprehensive Server Monitoring with Site24x7.
Monitor 10 servers for $9/Month.
Get alerted through email, SMS, voice calls or mobile push notifications.
Take corrective actions from your mobile device.
http://pubads.g.doubleclick.net/gampad/clk?id=154624111&iu=/4140/ostg.clktrk
_______________________________________________
saxon-help mailing list archived at http://saxon.markmail.org/
[hidden email]
https://lists.sourceforge.net/lists/listinfo/saxon-help