Quantcast

General approach to debugging regex in xsl:analyze-string that hangs Saxon?

classic Classic list List threaded Threaded
8 messages Options
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

General approach to debugging regex in xsl:analyze-string that hangs Saxon?

Sewell, David R. (drs2n)
It's the end of the day on a Friday so I'm not going to be able to post anything
like a complete analysis of what's going on, but basically here's what I'm
trying to fix:

I have a transform that uses a fairly complex regular expression inside an
<xsl:analyze-string> element. I have used the transform a number of times on
data sets without any problem, but today the transform simply hung when it got
to a certain point in processing.

I was able to figure out that the immediate cause is the presence in the strings
being run through xsl:analyze-string of a character sequence that probably
wasn't in any previous data (and it shouldn't be there, but my schema didn't
catch that).

What I'm wondering is: is there a way with Saxon specifically to debug what is
going on when a transform hangs on something like this (Saxon-PE 9.7.0.14J)?
More generally, should an XSLT processor ever simply hang in a situation like
this? (Not meant to be a complaint, just wondering if this could indicate an
edge error condition that Saxon should maybe be handling.)

David

--
David Sewell
Manager of Digital Initiatives
The University of Virginia Press
Email: [hidden email]   Tel: +1 434 924 9973
Web: http://www.upress.virginia.edu/rotunda

------------------------------------------------------------------------------
Check out the vibrant tech community on one of the world's most
engaging tech sites, SlashDot.org! http://sdm.link/slashdot
_______________________________________________
saxon-help mailing list archived at http://saxon.markmail.org/
[hidden email]
https://lists.sourceforge.net/lists/listinfo/saxon-help 
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: General approach to debugging regex in xsl:analyze-string that hangs Saxon?

Michael Kay
When you say it "hangs", what exactly are you observing? The term "hangs" usually means that everything is idle/waiting, and that's not at all characteristic of regex problems. Much more common with regexes is a busy state where the evaluation of the regex is using 100% of the CPU but still not getting anywhere, generally because of recursive back-tracking. Unfortunately regex syntax makes it very easy to write expressions that take exponential time to execute, and exponential means that it can take milliseconds to process 50 characters, but centuries to process 200. (Some regex engines are better than others at optimizing out the worst case scenarios, and Saxon probably isn't as good as some engines in this regard.)

My approach to debugging regexes is fairly brutal:

* stare at it for a while to see if anything hits you.

* ask yourself whether it really has to be that complicated. It doesn't, so simplify it.

* if necessary, simplify it even more than you should, ie. take out functionality until it works on simple data, then put functionality back in incrementally to make it handle a wider range of input.

One way of simplifying is often to split validation of the input string and processing of the input string into separate operations. Another is to tokenize first, then match the tokens second.

If the issue is performance then look for ambiguities in the regex, and eliminate them if at all possible. The usual nasty ambiguity is with a construct of the form

(A)*B

where A and B can start with the same character X, so when you hit an X, you don't know whether to continue in the loop or to terminate the loop.

In my experience it's usually the length of the input data that kills you, not the actual content. If you use regexes to process more than 100 characters or so then you need to be very careful to avoid ambiguous constructs.

Michael Kay
Saxonica

> On 3 Feb 2017, at 23:04, David Sewell <[hidden email]> wrote:
>
> It's the end of the day on a Friday so I'm not going to be able to post anything
> like a complete analysis of what's going on, but basically here's what I'm
> trying to fix:
>
> I have a transform that uses a fairly complex regular expression inside an
> <xsl:analyze-string> element. I have used the transform a number of times on
> data sets without any problem, but today the transform simply hung when it got
> to a certain point in processing.
>
> I was able to figure out that the immediate cause is the presence in the strings
> being run through xsl:analyze-string of a character sequence that probably
> wasn't in any previous data (and it shouldn't be there, but my schema didn't
> catch that).
>
> What I'm wondering is: is there a way with Saxon specifically to debug what is
> going on when a transform hangs on something like this (Saxon-PE 9.7.0.14J)?
> More generally, should an XSLT processor ever simply hang in a situation like
> this? (Not meant to be a complaint, just wondering if this could indicate an
> edge error condition that Saxon should maybe be handling.)
>
> David
>
> --
> David Sewell
> Manager of Digital Initiatives
> The University of Virginia Press
> Email: [hidden email]   Tel: +1 434 924 9973
> Web: http://www.upress.virginia.edu/rotunda
>
> ------------------------------------------------------------------------------
> Check out the vibrant tech community on one of the world's most
> engaging tech sites, SlashDot.org! http://sdm.link/slashdot
> _______________________________________________
> saxon-help mailing list archived at http://saxon.markmail.org/
> [hidden email]
> https://lists.sourceforge.net/lists/listinfo/saxon-help 


------------------------------------------------------------------------------
Check out the vibrant tech community on one of the world's most
engaging tech sites, SlashDot.org! http://sdm.link/slashdot
_______________________________________________
saxon-help mailing list archived at http://saxon.markmail.org/
[hidden email]
https://lists.sourceforge.net/lists/listinfo/saxon-help 
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: General approach to debugging regex in xsl:analyze-string that hangs Saxon?

Dimitre Novatchev
In reply to this post by Sewell, David R. (drs2n)

On Fri, Feb 3, 2017 at 3:04 PM, David Sewell <[hidden email]> wrote:
It's the end of the day on a Friday so I'm not going to be able to post anything
like a complete analysis of what's going on, but basically here's what I'm
trying to fix:

I have a transform that uses a fairly complex regular expression inside an
<xsl:analyze-string> element. I have used the transform a number of times on
data sets without any problem, but today the transform simply hung when it got
to a certain point in processing.

I was able to figure out that the immediate cause is the presence in the strings
being run through xsl:analyze-string of a character sequence that probably
wasn't in any previous data (and it shouldn't be there, but my schema didn't
catch that).

What I'm wondering is: is there a way with Saxon specifically to debug what is
going on when a transform hangs on something like this (Saxon-PE 9.7.0.14J)?
More generally, should an XSLT processor ever simply hang in a situation like
this? (Not meant to be a complaint, just wondering if this could indicate an
edge error condition that Saxon should maybe be handling.)

David

--
David Sewell
Manager of Digital Initiatives
The University of Virginia Press
Email: [hidden email]   Tel: <a href="tel:%2B1%20434%20924%209973" value="+14349249973">+1 434 924 9973
Web: http://www.upress.virginia.edu/rotunda

------------------------------------------------------------------------------
Check out the vibrant tech community on one of the world's most
engaging tech sites, SlashDot.org! http://sdm.link/slashdot
_______________________________________________
saxon-help mailing list archived at http://saxon.markmail.org/
[hidden email]
https://lists.sourceforge.net/lists/listinfo/saxon-help



--
Cheers,
Dimitre Novatchev
---------------------------------------
Truly great madness cannot be achieved without significant intelligence.
---------------------------------------
To invent, you need a good imagination and a pile of junk
-------------------------------------
Never fight an inanimate object
-------------------------------------
To avoid situations in which you might make mistakes may be the
biggest mistake of all
------------------------------------
Quality means doing it right when no one is looking.
-------------------------------------
You've achieved success in your field when you don't know whether what you're doing is work or play
-------------------------------------
To achieve the impossible dream, try going to sleep.
-------------------------------------
Facts do not cease to exist because they are ignored.
-------------------------------------
Typing monkeys will write all Shakespeare's works in 200yrs.Will they write all patents, too? :)
-------------------------------------
Sanity is madness put to good use.
-------------------------------------
I finally figured out the only reason to be alive is to enjoy it.
 

------------------------------------------------------------------------------
Check out the vibrant tech community on one of the world's most
engaging tech sites, SlashDot.org! http://sdm.link/slashdot
_______________________________________________
saxon-help mailing list archived at http://saxon.markmail.org/
[hidden email]
https://lists.sourceforge.net/lists/listinfo/saxon-help 
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: General approach to debugging regex in xsl:analyze-string that hangs Saxon?

Sewell, David R. (drs2n)
In reply to this post by Michael Kay
Michael,

Thanks for these pointers. In fact I spent some time over the weekend
experimenting (not sure why the message didn't post to the list until this
morning), and it's exactly as you suggest, the transform was not hanging but
instead taking an excruciatingly long time in direct (and exponential) relation
to the length of the input string. And complexity of the regular expression was
indeed a factor: this one causes problems

  (,\s+(([0-9n\.\-– ]+|and)+|[xliv\-–]+))+$

whereas this one does not

  (,\s+([0-9n\.\-– ]+|[xliv\-–]+))+$

(same thing minus looking for 'and'). The transform is used for parsing
back-of-the-book index entries to separate out the heading from the page number
sequence. I'll just have to see if I can simplify the regex without losing
matches.

David S.


On Mon, 6 Feb 2017, Michael Kay wrote:

> When you say it "hangs", what exactly are you observing? The term "hangs" usually means that everything is idle/waiting, and that's not at all characteristic of regex problems. Much more common with regexes is a busy state where the evaluation of the regex is using 100% of the CPU but still not getting anywhere, generally because of recursive back-tracking. Unfortunately regex syntax makes it very easy to write expressions that take exponential time to execute, and exponential means that it can take milliseconds to process 50 characters, but centuries to process 200. (Some regex engines are better than others at optimizing out the worst case scenarios, and Saxon probably isn't as good as some engines in this regard.)
>
> My approach to debugging regexes is fairly brutal:
>
> * stare at it for a while to see if anything hits you.
>
> * ask yourself whether it really has to be that complicated. It doesn't, so simplify it.
>
> * if necessary, simplify it even more than you should, ie. take out functionality until it works on simple data, then put functionality back in incrementally to make it handle a wider range of input.
>
> One way of simplifying is often to split validation of the input string and processing of the input string into separate operations. Another is to tokenize first, then match the tokens second.
>
> If the issue is performance then look for ambiguities in the regex, and eliminate them if at all possible. The usual nasty ambiguity is with a construct of the form
>
> (A)*B
>
> where A and B can start with the same character X, so when you hit an X, you don't know whether to continue in the loop or to terminate the loop.
>
> In my experience it's usually the length of the input data that kills you, not the actual content. If you use regexes to process more than 100 characters or so then you need to be very careful to avoid ambiguous constructs.
>
> Michael Kay
> Saxonica
>
>> On 3 Feb 2017, at 23:04, David Sewell <[hidden email]> wrote:
>>
>> It's the end of the day on a Friday so I'm not going to be able to post anything
>> like a complete analysis of what's going on, but basically here's what I'm
>> trying to fix:
>>
>> I have a transform that uses a fairly complex regular expression inside an
>> <xsl:analyze-string> element. I have used the transform a number of times on
>> data sets without any problem, but today the transform simply hung when it got
>> to a certain point in processing.
>>
>> I was able to figure out that the immediate cause is the presence in the strings
>> being run through xsl:analyze-string of a character sequence that probably
>> wasn't in any previous data (and it shouldn't be there, but my schema didn't
>> catch that).
>>
>> What I'm wondering is: is there a way with Saxon specifically to debug what is
>> going on when a transform hangs on something like this (Saxon-PE 9.7.0.14J)?
>> More generally, should an XSLT processor ever simply hang in a situation like
>> this? (Not meant to be a complaint, just wondering if this could indicate an
>> edge error condition that Saxon should maybe be handling.)
>>
>> David
>>
>> --
>> David Sewell
>> Manager of Digital Initiatives
>> The University of Virginia Press
>> Email: [hidden email]   Tel: +1 434 924 9973
>> Web: http://www.upress.virginia.edu/rotunda
>>
>> ------------------------------------------------------------------------------
>> Check out the vibrant tech community on one of the world's most
>> engaging tech sites, SlashDot.org! http://sdm.link/slashdot
>> _______________________________________________
>> saxon-help mailing list archived at http://saxon.markmail.org/
>> [hidden email]
>> https://lists.sourceforge.net/lists/listinfo/saxon-help
>
>
> ------------------------------------------------------------------------------
> Check out the vibrant tech community on one of the world's most
> engaging tech sites, SlashDot.org! http://sdm.link/slashdot
> _______________________________________________
> saxon-help mailing list archived at http://saxon.markmail.org/
> [hidden email]
> https://lists.sourceforge.net/lists/listinfo/saxon-help
>
--
David Sewell
Manager of Digital Initiatives
The University of Virginia Press
Email: [hidden email]   Tel: +1 434 924 9973
Web: http://www.upress.virginia.edu/rotunda
------------------------------------------------------------------------------
Check out the vibrant tech community on one of the world's most
engaging tech sites, SlashDot.org! http://sdm.link/slashdot
_______________________________________________
saxon-help mailing list archived at http://saxon.markmail.org/
[hidden email]
https://lists.sourceforge.net/lists/listinfo/saxon-help 
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: General approach to debugging regex in xsl:analyze-string that hangs Saxon?

Michael Kay
I would start by extracting the outer stuff into a call on tokenize, (it is essentially tokenizing on ",\s+" as the delimiter).

Then you've got three nested loops: do you really need them? The problem is, 99999 can match ([0-9]+ | and)+ in about 5! different ways. You can avoid that by writing it as ([0-9]+ (and [0-9]+)*) - and because the match is now unique, the captured groups are also going to become more predictable.

You've got an ambiguity in that both [0-9n\.\-– ]+ and [xliv\-–]+ can start with (or consist of) a single hyphen. Do you really need that? Can the hyphen actually occur at the start?

(Actually with \-- you're allowing hyphen in the character group twice; let's forget the "\-".)

Can you then eliminate the ambiguity at a hyphen character by writing for example ([xliv]+(\-[xliv]+)*)

Michael Kay
Saxonica


> On 6 Feb 2017, at 17:27, Sewell, David R. (drs2n) <[hidden email]> wrote:
>
> Michael,
>
> Thanks for these pointers. In fact I spent some time over the weekend experimenting (not sure why the message didn't post to the list until this morning), and it's exactly as you suggest, the transform was not hanging but instead taking an excruciatingly long time in direct (and exponential) relation to the length of the input string. And complexity of the regular expression was indeed a factor: this one causes problems
>
> ( , \s+ ( ( [0-9n\.\-– ]+ | and )+ | [xliv\-–]+ ) )+ $
>
> whereas this one does not
>
> ( , \s+ ( [0-9n\.\-– ]+ | [xliv\-–]+ ))+$
>
> (same thing minus looking for 'and'). The transform is used for parsing back-of-the-book index entries to separate out the heading from the page number sequence. I'll just have to see if I can simplify the regex without losing matches.
>
> David S.
>
>
> On Mon, 6 Feb 2017, Michael Kay wrote:
>
>> When you say it "hangs", what exactly are you observing? The term "hangs" usually means that everything is idle/waiting, and that's not at all characteristic of regex problems. Much more common with regexes is a busy state where the evaluation of the regex is using 100% of the CPU but still not getting anywhere, generally because of recursive back-tracking. Unfortunately regex syntax makes it very easy to write expressions that take exponential time to execute, and exponential means that it can take milliseconds to process 50 characters, but centuries to process 200. (Some regex engines are better than others at optimizing out the worst case scenarios, and Saxon probably isn't as good as some engines in this regard.)
>>
>> My approach to debugging regexes is fairly brutal:
>>
>> * stare at it for a while to see if anything hits you.
>>
>> * ask yourself whether it really has to be that complicated. It doesn't, so simplify it.
>>
>> * if necessary, simplify it even more than you should, ie. take out functionality until it works on simple data, then put functionality back in incrementally to make it handle a wider range of input.
>>
>> One way of simplifying is often to split validation of the input string and processing of the input string into separate operations. Another is to tokenize first, then match the tokens second.
>>
>> If the issue is performance then look for ambiguities in the regex, and eliminate them if at all possible. The usual nasty ambiguity is with a construct of the form
>>
>> (A)*B
>>
>> where A and B can start with the same character X, so when you hit an X, you don't know whether to continue in the loop or to terminate the loop.
>>
>> In my experience it's usually the length of the input data that kills you, not the actual content. If you use regexes to process more than 100 characters or so then you need to be very careful to avoid ambiguous constructs.
>>
>> Michael Kay
>> Saxonica
>>
>>> On 3 Feb 2017, at 23:04, David Sewell <[hidden email]> wrote:
>>>
>>> It's the end of the day on a Friday so I'm not going to be able to post anything
>>> like a complete analysis of what's going on, but basically here's what I'm
>>> trying to fix:
>>>
>>> I have a transform that uses a fairly complex regular expression inside an
>>> <xsl:analyze-string> element. I have used the transform a number of times on
>>> data sets without any problem, but today the transform simply hung when it got
>>> to a certain point in processing.
>>>
>>> I was able to figure out that the immediate cause is the presence in the strings
>>> being run through xsl:analyze-string of a character sequence that probably
>>> wasn't in any previous data (and it shouldn't be there, but my schema didn't
>>> catch that).
>>>
>>> What I'm wondering is: is there a way with Saxon specifically to debug what is
>>> going on when a transform hangs on something like this (Saxon-PE 9.7.0.14J)?
>>> More generally, should an XSLT processor ever simply hang in a situation like
>>> this? (Not meant to be a complaint, just wondering if this could indicate an
>>> edge error condition that Saxon should maybe be handling.)
>>>
>>> David
>>>
>>> --
>>> David Sewell
>>> Manager of Digital Initiatives
>>> The University of Virginia Press
>>> Email: [hidden email]   Tel: +1 434 924 9973
>>> Web: http://www.upress.virginia.edu/rotunda
>>>
>>> ------------------------------------------------------------------------------
>>> Check out the vibrant tech community on one of the world's most
>>> engaging tech sites, SlashDot.org! http://sdm.link/slashdot
>>> _______________________________________________
>>> saxon-help mailing list archived at http://saxon.markmail.org/
>>> [hidden email]
>>> https://lists.sourceforge.net/lists/listinfo/saxon-help
>>
>>
>> ------------------------------------------------------------------------------
>> Check out the vibrant tech community on one of the world's most
>> engaging tech sites, SlashDot.org! http://sdm.link/slashdot
>> _______________________________________________
>> saxon-help mailing list archived at http://saxon.markmail.org/
>> [hidden email]
>> https://lists.sourceforge.net/lists/listinfo/saxon-help
>>
>
> --
> David Sewell
> Manager of Digital Initiatives
> The University of Virginia Press
> Email: [hidden email]   Tel: +1 434 924 9973
> Web: http://www.upress.virginia.edu/rotunda------------------------------------------------------------------------------
> Check out the vibrant tech community on one of the world's most
> engaging tech sites, SlashDot.org! http://sdm.link/slashdot_______________________________________________
> saxon-help mailing list archived at http://saxon.markmail.org/
> [hidden email]
> https://lists.sourceforge.net/lists/listinfo/saxon-help


------------------------------------------------------------------------------
Check out the vibrant tech community on one of the world's most
engaging tech sites, SlashDot.org! http://sdm.link/slashdot
_______________________________________________
saxon-help mailing list archived at http://saxon.markmail.org/
[hidden email]
https://lists.sourceforge.net/lists/listinfo/saxon-help 
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: General approach to debugging regex in xsl:analyze-string that hangs Saxon?

Sewell, David R. (drs2n)
Thanks, Michael, will take a look at the suggested simplifications. The "\-–" is
actually not two hyphens but an ASCII hyphen and an en dash but I think the
other suggestions should help.

For what it's worth, the reason I never bothered to optimize the regex is that
the transform has always run in under a second or so, no matter the length of
the string, if and only if each string to process matches the regex. Adding an
unexpected character at the end of the string triggers the extreme slowdown. A
lesson to test a complicated regex against "bad data" as well as good,
obviously.

David

On Mon, 6 Feb 2017, Michael Kay wrote:

> I would start by extracting the outer stuff into a call on tokenize, (it is essentially tokenizing on ",\s+" as the delimiter).
>
> Then you've got three nested loops: do you really need them? The problem is, 99999 can match ([0-9]+ | and)+ in about 5! different ways. You can avoid that by writing it as ([0-9]+ (and [0-9]+)*) - and because the match is now unique, the captured groups are also going to become more predictable.
>
> You've got an ambiguity in that both [0-9n\.\-– ]+ and [xliv\-–]+ can start with (or consist of) a single hyphen. Do you really need that? Can the hyphen actually occur at the start?
>
> (Actually with \-- you're allowing hyphen in the character group twice; let's forget the "\-".)
>
> Can you then eliminate the ambiguity at a hyphen character by writing for example ([xliv]+(\-[xliv]+)*)
>
> Michael Kay
> Saxonica
>
>
>> On 6 Feb 2017, at 17:27, Sewell, David R. (drs2n) <[hidden email]> wrote:
>>
>> Michael,
>>
>> Thanks for these pointers. In fact I spent some time over the weekend experimenting (not sure why the message didn't post to the list until this morning), and it's exactly as you suggest, the transform was not hanging but instead taking an excruciatingly long time in direct (and exponential) relation to the length of the input string. And complexity of the regular expression was indeed a factor: this one causes problems
>>
>> ( , \s+ ( ( [0-9n\.\-– ]+ | and )+ | [xliv\-–]+ ) )+ $
>>
>> whereas this one does not
>>
>> ( , \s+ ( [0-9n\.\-– ]+ | [xliv\-–]+ ))+$
>>
>> (same thing minus looking for 'and'). The transform is used for parsing back-of-the-book index entries to separate out the heading from the page number sequence. I'll just have to see if I can simplify the regex without losing matches.
>>
>> David S.
>>
>>
>> On Mon, 6 Feb 2017, Michael Kay wrote:
>>
>>> When you say it "hangs", what exactly are you observing? The term "hangs" usually means that everything is idle/waiting, and that's not at all characteristic of regex problems. Much more common with regexes is a busy state where the evaluation of the regex is using 100% of the CPU but still not getting anywhere, generally because of recursive back-tracking. Unfortunately regex syntax makes it very easy to write expressions that take exponential time to execute, and exponential means that it can take milliseconds to process 50 characters, but centuries to process 200. (Some regex engines are better than others at optimizing out the worst case scenarios, and Saxon probably isn't as good as some engines in this regard.)
>>>
>>> My approach to debugging regexes is fairly brutal:
>>>
>>> * stare at it for a while to see if anything hits you.
>>>
>>> * ask yourself whether it really has to be that complicated. It doesn't, so simplify it.
>>>
>>> * if necessary, simplify it even more than you should, ie. take out functionality until it works on simple data, then put functionality back in incrementally to make it handle a wider range of input.
>>>
>>> One way of simplifying is often to split validation of the input string and processing of the input string into separate operations. Another is to tokenize first, then match the tokens second.
>>>
>>> If the issue is performance then look for ambiguities in the regex, and eliminate them if at all possible. The usual nasty ambiguity is with a construct of the form
>>>
>>> (A)*B
>>>
>>> where A and B can start with the same character X, so when you hit an X, you don't know whether to continue in the loop or to terminate the loop.
>>>
>>> In my experience it's usually the length of the input data that kills you, not the actual content. If you use regexes to process more than 100 characters or so then you need to be very careful to avoid ambiguous constructs.
>>>
>>> Michael Kay
>>> Saxonica
>>>
>>>> On 3 Feb 2017, at 23:04, David Sewell <[hidden email]> wrote:
>>>>
>>>> It's the end of the day on a Friday so I'm not going to be able to post anything
>>>> like a complete analysis of what's going on, but basically here's what I'm
>>>> trying to fix:
>>>>
>>>> I have a transform that uses a fairly complex regular expression inside an
>>>> <xsl:analyze-string> element. I have used the transform a number of times on
>>>> data sets without any problem, but today the transform simply hung when it got
>>>> to a certain point in processing.
>>>>
>>>> I was able to figure out that the immediate cause is the presence in the strings
>>>> being run through xsl:analyze-string of a character sequence that probably
>>>> wasn't in any previous data (and it shouldn't be there, but my schema didn't
>>>> catch that).
>>>>
>>>> What I'm wondering is: is there a way with Saxon specifically to debug what is
>>>> going on when a transform hangs on something like this (Saxon-PE 9.7.0.14J)?
>>>> More generally, should an XSLT processor ever simply hang in a situation like
>>>> this? (Not meant to be a complaint, just wondering if this could indicate an
>>>> edge error condition that Saxon should maybe be handling.)
>>>>
>>>> David
>>>>
>>>> --
>>>> David Sewell
>>>> Manager of Digital Initiatives
>>>> The University of Virginia Press
>>>> Email: [hidden email]   Tel: +1 434 924 9973
>>>> Web: http://www.upress.virginia.edu/rotunda
>>>>
>>>> ------------------------------------------------------------------------------
>>>> Check out the vibrant tech community on one of the world's most
>>>> engaging tech sites, SlashDot.org! http://sdm.link/slashdot
>>>> _______________________________________________
>>>> saxon-help mailing list archived at http://saxon.markmail.org/
>>>> [hidden email]
>>>> https://lists.sourceforge.net/lists/listinfo/saxon-help
>>>
>>>
>>> ------------------------------------------------------------------------------
>>> Check out the vibrant tech community on one of the world's most
>>> engaging tech sites, SlashDot.org! http://sdm.link/slashdot
>>> _______________________________________________
>>> saxon-help mailing list archived at http://saxon.markmail.org/
>>> [hidden email]
>>> https://lists.sourceforge.net/lists/listinfo/saxon-help
>>>
>>
>> --
>> David Sewell
>> Manager of Digital Initiatives
>> The University of Virginia Press
>> Email: [hidden email]   Tel: +1 434 924 9973
>> Web: http://www.upress.virginia.edu/rotunda------------------------------------------------------------------------------
>> Check out the vibrant tech community on one of the world's most
>> engaging tech sites, SlashDot.org! http://sdm.link/slashdot_______________________________________________
>> saxon-help mailing list archived at http://saxon.markmail.org/
>> [hidden email]
>> https://lists.sourceforge.net/lists/listinfo/saxon-help
>
>
> ------------------------------------------------------------------------------
> Check out the vibrant tech community on one of the world's most
> engaging tech sites, SlashDot.org! http://sdm.link/slashdot
> _______________________________________________
> saxon-help mailing list archived at http://saxon.markmail.org/
> [hidden email]
> https://lists.sourceforge.net/lists/listinfo/saxon-help
--
David Sewell
Manager of Digital Initiatives
The University of Virginia Press
Email: [hidden email]   Tel: +1 434 924 9973
Web: http://www.upress.virginia.edu/rotunda
------------------------------------------------------------------------------
Check out the vibrant tech community on one of the world's most
engaging tech sites, SlashDot.org! http://sdm.link/slashdot
_______________________________________________
saxon-help mailing list archived at http://saxon.markmail.org/
[hidden email]
https://lists.sourceforge.net/lists/listinfo/saxon-help 
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: General approach to debugging regex in xsl:analyze-string that hangs Saxon?

Michael Kay
>
> For what it's worth, the reason I never bothered to optimize the regex is that the transform has always run in under a second or so, no matter the length of the string, if and only if each string to process matches the regex.

Yes, that's fairly typical of situations where a regex can match the same string in a zillion different ways. If it matches, the engine finds a match very quickly and exits. If it doesn't match, it tries a zillion different ways without recognizing that they if one fails, they are all going to fail.

Michael Kay
Saxonica
------------------------------------------------------------------------------
Check out the vibrant tech community on one of the world's most
engaging tech sites, SlashDot.org! http://sdm.link/slashdot
_______________________________________________
saxon-help mailing list archived at http://saxon.markmail.org/
[hidden email]
https://lists.sourceforge.net/lists/listinfo/saxon-help 
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: General approach to debugging regex in xsl:analyze-string that hangs Saxon?

Jirka Kosek
In reply to this post by Sewell, David R. (drs2n)
On 6.2.2017 22:10, Sewell, David R. (drs2n) wrote:
> For what it's worth, the reason I never bothered to optimize the regex
> is that the transform has always run in under a second or so, no matter
> the length of the string, if and only if each string to process matches
> the regex. Adding an unexpected character at the end of the string
> triggers the extreme slowdown. A lesson to test a complicated regex
> against "bad data" as well as good, obviously.

Sometimes you can replace matching against regexp by creating grammar
for your string data and parsing it by parser generated by
http://www.bottlecaps.de/rex/

Advantage is that grammar is much more readable then regular expression
and it will not allow you to create much ambiguity -- ambiguous grammar
will simply not compile. Also for bad input you will get reasonable
diagnostic messages.

It depends on use-case but if your regexp is complex then using XSLT
parser generated by REx usually provides slightly better performance and
much less memory consumption.

                                        Jirka

--
------------------------------------------------------------------
  Jirka Kosek      e-mail: [hidden email]      http://xmlguru.cz
------------------------------------------------------------------
     Professional XML and Web consulting and training services
DocBook/DITA customization, custom XSLT/XSL-FO document processing
------------------------------------------------------------------
 OASIS DocBook TC member, W3C Invited Expert, ISO JTC1/SC34 rep.
------------------------------------------------------------------
    Bringing you XML Prague conference    http://xmlprague.cz
------------------------------------------------------------------


------------------------------------------------------------------------------
Check out the vibrant tech community on one of the world's most
engaging tech sites, SlashDot.org! http://sdm.link/slashdot
_______________________________________________
saxon-help mailing list archived at http://saxon.markmail.org/
[hidden email]
https://lists.sourceforge.net/lists/listinfo/saxon-help 

signature.asc (203 bytes) Download Attachment
Loading...