Quantcast

Confusion on analyze-string behavior

classic Classic list List threaded Threaded
6 messages Options
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Confusion on analyze-string behavior

David Rudel
I spent nearly 2 hours investigating and trying to make a simpler repro of this and eventually gave up. Part of my confusion is that Saxon 9.6.0.5 under Oxygen 17.0 is acting differently from Saxon 9.6.0.7 under Oxygen 17.1.  But part of my confusion has nothing to do with that.

I have a csv file where each line ends in '\r\n'

I have a modified version of the csv parser from Mike's book.

This function (see attached source) gives the expected result when run using Saxon 9.6.0.5 under Oxygen 17.0:

<table>
<row Symbol="USB" Float="1762793129"/>
<row Symbol="UTX" Float="838507614"/>
.....
</table> 

I began to notice different behavior when I switched to Oxygen 17.1, which uses 9.6.0.7.  Now every line is contributing two rows, an empty one and then the expected one:

<table>
<row Symbol="" Float=""/>
<row Symbol="USB" Float="1762793129"/>
<row Symbol="" Float=""/>
<row Symbol="UTX" Float="838507614"/>
<row Symbol="" Float=""/>
<row Symbol="V" Float="2422469207"/>
<row Symbol="" Float=""/>
....
</table>

I traced this inside the function using <message> statements:

<xsl:analyze-string select="$textStream" regex="{$separator}">
<xsl:non-matching-substring>
<xsl:message>snippet number <xsl:value-of select="position()"/>: <xsl:value-of select="."/></xsl:message>
...


Each non-matching substring gives rise to a row in the table. What I found is that in 17.1 (Saxon 9.6.0.7), every row from the csv file is begetting two rows: the expected one and an additional  &#xD;

This leads to empty rows in the output.

There are no instances of '\r\n\r' or '\r\r\n' in the text stream, so I do not know how a &#xD; is popping out as an additional non-matching string each time.

But then I noticed (putting in the "snippet" message checker as shown above) that in both 17.0 and 17.1 the "position()" value is incrementing twice for each row, it is just that in 17.0 the &#xD; is not showing up when I use the <xsl:message/> function and (later) is not messing up the output file.

What made this so hard to investigate is that these extra messages only show up in full macro (attached) not in smaller ones I tried that use the same syntax.

I'd like to know what is causing this behavior.
-David

--

"A false conclusion, once arrived at and widely accepted is not dislodged easily, and the less it is understood, the more tenaciously it is held." - Cantor's Law of Preservation of Ignorance.

------------------------------------------------------------------------------
Transform Data into Opportunity.
Accelerate data analysis in your applications with
Intel Data Analytics Acceleration Library.
Click to learn more.
http://pubads.g.doubleclick.net/gampad/clk?id=278785351&iu=/4140
_______________________________________________
saxon-help mailing list archived at http://saxon.markmail.org/
[hidden email]
https://lists.sourceforge.net/lists/listinfo/saxon-help 

Float_Data.csv (406 bytes) Download Attachment
Analyze-String_test.xsl (4K) Download Attachment
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: Confusion on analyze-string behavior

Michael Kay
Reproduced on 9.7.0.4.

It seems that the problem doesn't occur if you add an <xsl:matching-substring/> element.

Note that analyze-string with a non-matching-substring child and no matching-substring child does essentially the same as tokenize(), so your outer xsl:analyze-string could be replaced by <xsl:for-each select="tokenize(....)">

Michael Kay
Saxonica


> On 25 Mar 2016, at 09:26, David Rudel <[hidden email]> wrote:
>
> I spent nearly 2 hours investigating and trying to make a simpler repro of this and eventually gave up. Part of my confusion is that Saxon 9.6.0.5 under Oxygen 17.0 is acting differently from Saxon 9.6.0.7 under Oxygen 17.1.  But part of my confusion has nothing to do with that.
>
> I have a csv file where each line ends in '\r\n'
>
> I have a modified version of the csv parser from Mike's book.
>
> This function (see attached source) gives the expected result when run using Saxon 9.6.0.5 under Oxygen 17.0:
>
> <table>
> <row Symbol="USB" Float="1762793129"/>
> <row Symbol="UTX" Float="838507614"/>
> .....
> </table>
>
> I began to notice different behavior when I switched to Oxygen 17.1, which uses 9.6.0.7.  Now every line is contributing two rows, an empty one and then the expected one:
>
> <table>
> <row Symbol="" Float=""/>
> <row Symbol="USB" Float="1762793129"/>
> <row Symbol="" Float=""/>
> <row Symbol="UTX" Float="838507614"/>
> <row Symbol="" Float=""/>
> <row Symbol="V" Float="2422469207"/>
> <row Symbol="" Float=""/>
> ....
> </table>
>
> I traced this inside the function using <message> statements:
>
> <xsl:analyze-string select="$textStream" regex="{$separator}">
> <xsl:non-matching-substring>
> <xsl:message>snippet number <xsl:value-of select="position()"/>: <xsl:value-of select="."/></xsl:message>
> ...
>
>
> Each non-matching substring gives rise to a row in the table. What I found is that in 17.1 (Saxon 9.6.0.7), every row from the csv file is begetting two rows: the expected one and an additional  &#xD;
>
> This leads to empty rows in the output.
>
> There are no instances of '\r\n\r' or '\r\r\n' in the text stream, so I do not know how a &#xD; is popping out as an additional non-matching string each time.
>
> But then I noticed (putting in the "snippet" message checker as shown above) that in both 17.0 and 17.1 the "position()" value is incrementing twice for each row, it is just that in 17.0 the &#xD; is not showing up when I use the <xsl:message/> function and (later) is not messing up the output file.
>
> What made this so hard to investigate is that these extra messages only show up in full macro (attached) not in smaller ones I tried that use the same syntax.
>
> I'd like to know what is causing this behavior.
> -David
>
> --
>
> "A false conclusion, once arrived at and widely accepted is not dislodged easily, and the less it is understood, the more tenaciously it is held." - Cantor's Law of Preservation of Ignorance.
> <Float_Data.csv><Analyze-String_test.xsl>------------------------------------------------------------------------------
> Transform Data into Opportunity.
> Accelerate data analysis in your applications with
> Intel Data Analytics Acceleration Library.
> Click to learn more.
> http://pubads.g.doubleclick.net/gampad/clk?id=278785351&iu=/4140_______________________________________________
> saxon-help mailing list archived at http://saxon.markmail.org/
> [hidden email]
> https://lists.sourceforge.net/lists/listinfo/saxon-help



------------------------------------------------------------------------------
Transform Data into Opportunity.
Accelerate data analysis in your applications with
Intel Data Analytics Acceleration Library.
Click to learn more.
http://pubads.g.doubleclick.net/gampad/clk?id=278785351&iu=/4140
_______________________________________________
saxon-help mailing list archived at http://saxon.markmail.org/
[hidden email]
https://lists.sourceforge.net/lists/listinfo/saxon-help 
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: Confusion on analyze-string behavior

Michael Kay
In reply to this post by David Rudel
Actually it seems to be working correctly with --generateByteCode:off and incorrectly with --generateByteCode:on.

Michael Kay
Saxonica

> On 25 Mar 2016, at 09:26, David Rudel <[hidden email]> wrote:
>
> I spent nearly 2 hours investigating and trying to make a simpler repro of this and eventually gave up. Part of my confusion is that Saxon 9.6.0.5 under Oxygen 17.0 is acting differently from Saxon 9.6.0.7 under Oxygen 17.1.  But part of my confusion has nothing to do with that.
>
> I have a csv file where each line ends in '\r\n'
>
> I have a modified version of the csv parser from Mike's book.
>
> This function (see attached source) gives the expected result when run using Saxon 9.6.0.5 under Oxygen 17.0:
>
> <table>
> <row Symbol="USB" Float="1762793129"/>
> <row Symbol="UTX" Float="838507614"/>
> .....
> </table>
>
> I began to notice different behavior when I switched to Oxygen 17.1, which uses 9.6.0.7.  Now every line is contributing two rows, an empty one and then the expected one:
>
> <table>
> <row Symbol="" Float=""/>
> <row Symbol="USB" Float="1762793129"/>
> <row Symbol="" Float=""/>
> <row Symbol="UTX" Float="838507614"/>
> <row Symbol="" Float=""/>
> <row Symbol="V" Float="2422469207"/>
> <row Symbol="" Float=""/>
> ....
> </table>
>
> I traced this inside the function using <message> statements:
>
> <xsl:analyze-string select="$textStream" regex="{$separator}">
> <xsl:non-matching-substring>
> <xsl:message>snippet number <xsl:value-of select="position()"/>: <xsl:value-of select="."/></xsl:message>
> ...
>
>
> Each non-matching substring gives rise to a row in the table. What I found is that in 17.1 (Saxon 9.6.0.7), every row from the csv file is begetting two rows: the expected one and an additional  &#xD;
>
> This leads to empty rows in the output.
>
> There are no instances of '\r\n\r' or '\r\r\n' in the text stream, so I do not know how a &#xD; is popping out as an additional non-matching string each time.
>
> But then I noticed (putting in the "snippet" message checker as shown above) that in both 17.0 and 17.1 the "position()" value is incrementing twice for each row, it is just that in 17.0 the &#xD; is not showing up when I use the <xsl:message/> function and (later) is not messing up the output file.
>
> What made this so hard to investigate is that these extra messages only show up in full macro (attached) not in smaller ones I tried that use the same syntax.
>
> I'd like to know what is causing this behavior.
> -David
>
> --
>
> "A false conclusion, once arrived at and widely accepted is not dislodged easily, and the less it is understood, the more tenaciously it is held." - Cantor's Law of Preservation of Ignorance.
> <Float_Data.csv><Analyze-String_test.xsl>------------------------------------------------------------------------------
> Transform Data into Opportunity.
> Accelerate data analysis in your applications with
> Intel Data Analytics Acceleration Library.
> Click to learn more.
> http://pubads.g.doubleclick.net/gampad/clk?id=278785351&iu=/4140_______________________________________________
> saxon-help mailing list archived at http://saxon.markmail.org/
> [hidden email]
> https://lists.sourceforge.net/lists/listinfo/saxon-help



------------------------------------------------------------------------------
Transform Data into Opportunity.
Accelerate data analysis in your applications with
Intel Data Analytics Acceleration Library.
Click to learn more.
http://pubads.g.doubleclick.net/gampad/clk?id=278785351&iu=/4140
_______________________________________________
saxon-help mailing list archived at http://saxon.markmail.org/
[hidden email]
https://lists.sourceforge.net/lists/listinfo/saxon-help 
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: Confusion on analyze-string behavior

Michael Kay
In reply to this post by David Rudel
I have logged this as a bug at


Please track progress there. Since bytecode generation bugs are quite tricky to diagnose, I think we're unlikely to make progress until after the Easter holiday. In the meantime I think you can work around it by switching of bytecode generation, or by using tokenize() in place of analyze-string.

Michael Kay
Saxonica


On 25 Mar 2016, at 09:26, David Rudel <[hidden email]> wrote:

I spent nearly 2 hours investigating and trying to make a simpler repro of this and eventually gave up. Part of my confusion is that Saxon 9.6.0.5 under Oxygen 17.0 is acting differently from Saxon 9.6.0.7 under Oxygen 17.1.  But part of my confusion has nothing to do with that.

I have a csv file where each line ends in '\r\n'

I have a modified version of the csv parser from Mike's book.

This function (see attached source) gives the expected result when run using Saxon 9.6.0.5 under Oxygen 17.0:

<table>
<row Symbol="USB" Float="1762793129"/>
<row Symbol="UTX" Float="838507614"/>
.....
</table> 

I began to notice different behavior when I switched to Oxygen 17.1, which uses 9.6.0.7.  Now every line is contributing two rows, an empty one and then the expected one:

<table>
<row Symbol="" Float=""/>
<row Symbol="USB" Float="1762793129"/>
<row Symbol="" Float=""/>
<row Symbol="UTX" Float="838507614"/>
<row Symbol="" Float=""/>
<row Symbol="V" Float="2422469207"/>
<row Symbol="" Float=""/>
....
</table>

I traced this inside the function using <message> statements:

<xsl:analyze-string select="$textStream" regex="{$separator}">
<xsl:non-matching-substring>
<xsl:message>snippet number <xsl:value-of select="position()"/>: <xsl:value-of select="."/></xsl:message>
...


Each non-matching substring gives rise to a row in the table. What I found is that in 17.1 (Saxon 9.6.0.7), every row from the csv file is begetting two rows: the expected one and an additional  &#xD;

This leads to empty rows in the output.

There are no instances of '\r\n\r' or '\r\r\n' in the text stream, so I do not know how a &#xD; is popping out as an additional non-matching string each time.

But then I noticed (putting in the "snippet" message checker as shown above) that in both 17.0 and 17.1 the "position()" value is incrementing twice for each row, it is just that in 17.0 the &#xD; is not showing up when I use the <xsl:message/> function and (later) is not messing up the output file.

What made this so hard to investigate is that these extra messages only show up in full macro (attached) not in smaller ones I tried that use the same syntax.

I'd like to know what is causing this behavior.
-David

--

"A false conclusion, once arrived at and widely accepted is not dislodged easily, and the less it is understood, the more tenaciously it is held." - Cantor's Law of Preservation of Ignorance.
<Float_Data.csv><Analyze-String_test.xsl>------------------------------------------------------------------------------
Transform Data into Opportunity.
Accelerate data analysis in your applications with
Intel Data Analytics Acceleration Library.
Click to learn more.
http://pubads.g.doubleclick.net/gampad/clk?id=278785351&iu=/4140_______________________________________________
saxon-help mailing list archived at http://saxon.markmail.org/
[hidden email]
https://lists.sourceforge.net/lists/listinfo/saxon-help


------------------------------------------------------------------------------
Transform Data into Opportunity.
Accelerate data analysis in your applications with
Intel Data Analytics Acceleration Library.
Click to learn more.
http://pubads.g.doubleclick.net/gampad/clk?id=278785351&iu=/4140
_______________________________________________
saxon-help mailing list archived at http://saxon.markmail.org/
[hidden email]
https://lists.sourceforge.net/lists/listinfo/saxon-help 
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: Confusion on analyze-string behavior

David Rudel
Thanks, Mike. I was worried that I was going to come back to this and find out I had not taken into consideration something silly about multi-line versus single-line mode in the regex.

Thanks for the note on tokenize! I use it in the procedure already to ensure the separator exists... wonder why I didn't think to use it as the main solution. My short-term hack was to do <xsl:if test="boolean(normalize-space(.))"> to avoid the empty results.

On Fri, Mar 25, 2016 at 2:03 PM, Michael Kay <[hidden email]> wrote:
I have logged this as a bug at


Please track progress there. Since bytecode generation bugs are quite tricky to diagnose, I think we're unlikely to make progress until after the Easter holiday. In the meantime I think you can work around it by switching of bytecode generation, or by using tokenize() in place of analyze-string.

Michael Kay
Saxonica


On 25 Mar 2016, at 09:26, David Rudel <[hidden email]> wrote:

I spent nearly 2 hours investigating and trying to make a simpler repro of this and eventually gave up. Part of my confusion is that Saxon 9.6.0.5 under Oxygen 17.0 is acting differently from Saxon 9.6.0.7 under Oxygen 17.1.  But part of my confusion has nothing to do with that.

I have a csv file where each line ends in '\r\n'

I have a modified version of the csv parser from Mike's book.

This function (see attached source) gives the expected result when run using Saxon 9.6.0.5 under Oxygen 17.0:

<table>
<row Symbol="USB" Float="1762793129"/>
<row Symbol="UTX" Float="838507614"/>
.....
</table> 

I began to notice different behavior when I switched to Oxygen 17.1, which uses 9.6.0.7.  Now every line is contributing two rows, an empty one and then the expected one:

<table>
<row Symbol="" Float=""/>
<row Symbol="USB" Float="1762793129"/>
<row Symbol="" Float=""/>
<row Symbol="UTX" Float="838507614"/>
<row Symbol="" Float=""/>
<row Symbol="V" Float="2422469207"/>
<row Symbol="" Float=""/>
....
</table>

I traced this inside the function using <message> statements:

<xsl:analyze-string select="$textStream" regex="{$separator}">
<xsl:non-matching-substring>
<xsl:message>snippet number <xsl:value-of select="position()"/>: <xsl:value-of select="."/></xsl:message>
...


Each non-matching substring gives rise to a row in the table. What I found is that in 17.1 (Saxon 9.6.0.7), every row from the csv file is begetting two rows: the expected one and an additional  &#xD;

This leads to empty rows in the output.

There are no instances of '\r\n\r' or '\r\r\n' in the text stream, so I do not know how a &#xD; is popping out as an additional non-matching string each time.

But then I noticed (putting in the "snippet" message checker as shown above) that in both 17.0 and 17.1 the "position()" value is incrementing twice for each row, it is just that in 17.0 the &#xD; is not showing up when I use the <xsl:message/> function and (later) is not messing up the output file.

What made this so hard to investigate is that these extra messages only show up in full macro (attached) not in smaller ones I tried that use the same syntax.

I'd like to know what is causing this behavior.
-David

--

"A false conclusion, once arrived at and widely accepted is not dislodged easily, and the less it is understood, the more tenaciously it is held." - Cantor's Law of Preservation of Ignorance.
<Float_Data.csv><Analyze-String_test.xsl>------------------------------------------------------------------------------
Transform Data into Opportunity.
Accelerate data analysis in your applications with
Intel Data Analytics Acceleration Library.
Click to learn more.
http://pubads.g.doubleclick.net/gampad/clk?id=278785351&iu=/4140_______________________________________________
saxon-help mailing list archived at http://saxon.markmail.org/
[hidden email]
https://lists.sourceforge.net/lists/listinfo/saxon-help


------------------------------------------------------------------------------
Transform Data into Opportunity.
Accelerate data analysis in your applications with
Intel Data Analytics Acceleration Library.
Click to learn more.
http://pubads.g.doubleclick.net/gampad/clk?id=278785351&iu=/4140
_______________________________________________
saxon-help mailing list archived at http://saxon.markmail.org/
[hidden email]
https://lists.sourceforge.net/lists/listinfo/saxon-help



--

"A false conclusion, once arrived at and widely accepted is not dislodged easily, and the less it is understood, the more tenaciously it is held." - Cantor's Law of Preservation of Ignorance.

------------------------------------------------------------------------------
Transform Data into Opportunity.
Accelerate data analysis in your applications with
Intel Data Analytics Acceleration Library.
Click to learn more.
http://pubads.g.doubleclick.net/gampad/clk?id=278785351&iu=/4140
_______________________________________________
saxon-help mailing list archived at http://saxon.markmail.org/
[hidden email]
https://lists.sourceforge.net/lists/listinfo/saxon-help 
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: Confusion on analyze-string behavior

O'Neil Delpratt
Just to report back that we have now resolved this bug which will be available in the next maintenance release.

Fix applied in the AnalyzeStringCompiler. The bytecode loop label was incorrectly put in a conditional

kind regards,

O’Neil

On 25 Mar 2016, at 16:31, David Rudel <[hidden email]> wrote:

Thanks, Mike. I was worried that I was going to come back to this and find out I had not taken into consideration something silly about multi-line versus single-line mode in the regex.

Thanks for the note on tokenize! I use it in the procedure already to ensure the separator exists... wonder why I didn't think to use it as the main solution. My short-term hack was to do <xsl:if test="boolean(normalize-space(.))"> to avoid the empty results.

On Fri, Mar 25, 2016 at 2:03 PM, Michael Kay <[hidden email]> wrote:
I have logged this as a bug at


Please track progress there. Since bytecode generation bugs are quite tricky to diagnose, I think we're unlikely to make progress until after the Easter holiday. In the meantime I think you can work around it by switching of bytecode generation, or by using tokenize() in place of analyze-string.

Michael Kay
Saxonica


On 25 Mar 2016, at 09:26, David Rudel <[hidden email]> wrote:

I spent nearly 2 hours investigating and trying to make a simpler repro of this and eventually gave up. Part of my confusion is that Saxon 9.6.0.5 under Oxygen 17.0 is acting differently from Saxon 9.6.0.7 under Oxygen 17.1.  But part of my confusion has nothing to do with that.

I have a csv file where each line ends in '\r\n'

I have a modified version of the csv parser from Mike's book.

This function (see attached source) gives the expected result when run using Saxon 9.6.0.5 under Oxygen 17.0:

<table>
<row Symbol="USB" Float="1762793129"/>
<row Symbol="UTX" Float="838507614"/>
.....
</table> 

I began to notice different behavior when I switched to Oxygen 17.1, which uses 9.6.0.7.  Now every line is contributing two rows, an empty one and then the expected one:

<table>
<row Symbol="" Float=""/>
<row Symbol="USB" Float="1762793129"/>
<row Symbol="" Float=""/>
<row Symbol="UTX" Float="838507614"/>
<row Symbol="" Float=""/>
<row Symbol="V" Float="2422469207"/>
<row Symbol="" Float=""/>
....
</table>

I traced this inside the function using <message> statements:

<xsl:analyze-string select="$textStream" regex="{$separator}">
<xsl:non-matching-substring>
<xsl:message>snippet number <xsl:value-of select="position()"/>: <xsl:value-of select="."/></xsl:message>
...


Each non-matching substring gives rise to a row in the table. What I found is that in 17.1 (Saxon 9.6.0.7), every row from the csv file is begetting two rows: the expected one and an additional  &#xD;

This leads to empty rows in the output.

There are no instances of '\r\n\r' or '\r\r\n' in the text stream, so I do not know how a &#xD; is popping out as an additional non-matching string each time.

But then I noticed (putting in the "snippet" message checker as shown above) that in both 17.0 and 17.1 the "position()" value is incrementing twice for each row, it is just that in 17.0 the &#xD; is not showing up when I use the <xsl:message/> function and (later) is not messing up the output file.

What made this so hard to investigate is that these extra messages only show up in full macro (attached) not in smaller ones I tried that use the same syntax.

I'd like to know what is causing this behavior.
-David

--

"A false conclusion, once arrived at and widely accepted is not dislodged easily, and the less it is understood, the more tenaciously it is held." - Cantor's Law of Preservation of Ignorance.
<Float_Data.csv><Analyze-String_test.xsl>------------------------------------------------------------------------------
Transform Data into Opportunity.
Accelerate data analysis in your applications with
Intel Data Analytics Acceleration Library.
Click to learn more.
http://pubads.g.doubleclick.net/gampad/clk?id=278785351&iu=/4140_______________________________________________
saxon-help mailing list archived at http://saxon.markmail.org/
[hidden email]
https://lists.sourceforge.net/lists/listinfo/saxon-help


------------------------------------------------------------------------------
Transform Data into Opportunity.
Accelerate data analysis in your applications with
Intel Data Analytics Acceleration Library.
Click to learn more.
http://pubads.g.doubleclick.net/gampad/clk?id=278785351&iu=/4140
_______________________________________________
saxon-help mailing list archived at http://saxon.markmail.org/
[hidden email]
https://lists.sourceforge.net/lists/listinfo/saxon-help



--

"A false conclusion, once arrived at and widely accepted is not dislodged easily, and the less it is understood, the more tenaciously it is held." - Cantor's Law of Preservation of Ignorance.
------------------------------------------------------------------------------
Transform Data into Opportunity.
Accelerate data analysis in your applications with
Intel Data Analytics Acceleration Library.
Click to learn more.
http://pubads.g.doubleclick.net/gampad/clk?id=278785351&iu=/4140_______________________________________________
saxon-help mailing list archived at http://saxon.markmail.org/
[hidden email]
https://lists.sourceforge.net/lists/listinfo/saxon-help


------------------------------------------------------------------------------
Find and fix application performance issues faster with Applications Manager
Applications Manager provides deep performance insights into multiple tiers of
your business applications. It resolves application problems quickly and
reduces your MTTR. Get your free trial!
https://ad.doubleclick.net/ddm/clk/302982198;130105516;z
_______________________________________________
saxon-help mailing list archived at http://saxon.markmail.org/
[hidden email]
https://lists.sourceforge.net/lists/listinfo/saxon-help 
Loading...