what causes the difference

classic Classic list List threaded Threaded
3 messages Options
Reply | Threaded
Open this post in threaded view
|

what causes the difference

Rolf Schumacher-2
I am using Saxon-HE-9.7.0-15.jar and I am about to create a
transformation in order to anonymize the input.

As a first step I was looking for all distinct words in the input and
came across a behavior that I do not comprehend.

I was not sure whether it speeds up to use mode keyword with templates
or not and came across a result that puzzles me.

I boiled it down to this transformation rules:

<?xml version="1.0" encoding="UTF-8"?>
<xsl:stylesheet version="2.0"
xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
xmlns:xs="http://www.w3.org/2001/XMLSchema" xmlns:fn="fn"
exclude-result-prefixes="xs fn">

   <xsl:output method="xml" encoding="UTF-8"/>
   <xsl:strip-space elements="*"/>

   <xsl:template match="/">
     <xsl:variable name="allwords" as="xs:string+">
       <xsl:apply-templates select="*" mode="lookup"/>
     </xsl:variable>
     <xsl:variable name="words" select="distinct-values($allwords)"/>
     <root>
       <xsl:attribute name="allwords" select="count($allwords)"/>
       <xsl:attribute name="words" select="count($words)"/>
     </root>
   </xsl:template>

   <xsl:template match="*">
       <xsl:value-of select="tokenize(text(),'[^A-Za-z0-9äöüßÄÖÜ]+')" />
       <xsl:apply-templates select="*" />
   </xsl:template>

   <xsl:template match="*" mode="lookup">
       <xsl:value-of select="tokenize(text(),'[^A-Za-z0-9äöüßÄÖÜ]+')" />
       <xsl:apply-templates select="*" />
   </xsl:template>

</xsl:stylesheet>

For a certain input (~30MB) this led to the result:

<?xml version="1.0" encoding="UTF-8"?><root allwords="696831" words="7617"/>

However, commenting the second template out, I get a different result
from the very same input:

<?xml version="1.0" encoding="UTF-8"?><root allwords="531375" words="7620"/>

To make it very clear, here are the transformation rules for the second
results:

<?xml version="1.0" encoding="UTF-8"?>
<xsl:stylesheet version="2.0"
xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
xmlns:xs="http://www.w3.org/2001/XMLSchema" xmlns:fn="fn"
exclude-result-prefixes="xs fn">

   <xsl:output method="xml" encoding="UTF-8"/>
   <xsl:strip-space elements="*"/>

   <xsl:template match="/">
     <xsl:variable name="allwords" as="xs:string+">
       <xsl:apply-templates select="*" mode="lookup"/>
     </xsl:variable>
     <xsl:variable name="words" select="distinct-values($allwords)"/>
     <root>
       <xsl:attribute name="allwords" select="count($allwords)"/>
       <xsl:attribute name="words" select="count($words)"/>
     </root>
   </xsl:template>

<!--   <xsl:template match="*"> -->
<!--       <xsl:value-of
select="tokenize(text(),'[^A-Za-z0-9äöüßÄÖÜ]+')" /> -->
<!--       <xsl:apply-templates select="*" /> -->
<!--   </xsl:template> -->

   <xsl:template match="*" mode="lookup">
       <xsl:value-of select="tokenize(text(),'[^A-Za-z0-9äöüßÄÖÜ]+')" />
       <xsl:apply-templates select="*" />
   </xsl:template>

</xsl:stylesheet>

Question: what is the semantic difference between the two transformation
rules that could explain the difference in the result?

Kind Regards


Rolf

------------------------------------------------------------------------------
Check out the vibrant tech community on one of the world's most
engaging tech sites, Slashdot.org! http://sdm.link/slashdot
_______________________________________________
saxon-help mailing list archived at http://saxon.markmail.org/
[hidden email]
https://lists.sourceforge.net/lists/listinfo/saxon-help 
Reply | Threaded
Open this post in threaded view
|

Re: what causes the difference

Rolf Schumacher-2
Sorry, at the moment I released the mail I saw the difference:
mode="lookup" is missing in the last template.

---
Viele Grüße, Best Regards

Rolf Schumacher

Am 07.04.2017 12:20, schrieb Rolf Schumacher:

> I am using Saxon-HE-9.7.0-15.jar and I am about to create a
> transformation in order to anonymize the input.
>
> As a first step I was looking for all distinct words in the input and
> came across a behavior that I do not comprehend.
>
> I was not sure whether it speeds up to use mode keyword with templates
> or not and came across a result that puzzles me.
>
> I boiled it down to this transformation rules:
>
> <?xml version="1.0" encoding="UTF-8"?>
> <xsl:stylesheet version="2.0"
> xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
> xmlns:xs="http://www.w3.org/2001/XMLSchema" xmlns:fn="fn"
> exclude-result-prefixes="xs fn">
>
>    <xsl:output method="xml" encoding="UTF-8"/>
>    <xsl:strip-space elements="*"/>
>
>    <xsl:template match="/">
>      <xsl:variable name="allwords" as="xs:string+">
>        <xsl:apply-templates select="*" mode="lookup"/>
>      </xsl:variable>
>      <xsl:variable name="words" select="distinct-values($allwords)"/>
>      <root>
>        <xsl:attribute name="allwords" select="count($allwords)"/>
>        <xsl:attribute name="words" select="count($words)"/>
>      </root>
>    </xsl:template>
>
>    <xsl:template match="*">
>        <xsl:value-of select="tokenize(text(),'[^A-Za-z0-9äöüßÄÖÜ]+')"
> />
>        <xsl:apply-templates select="*" />
>    </xsl:template>
>
>    <xsl:template match="*" mode="lookup">
>        <xsl:value-of select="tokenize(text(),'[^A-Za-z0-9äöüßÄÖÜ]+')"
> />
>        <xsl:apply-templates select="*" />
>    </xsl:template>
>
> </xsl:stylesheet>
>
> For a certain input (~30MB) this led to the result:
>
> <?xml version="1.0" encoding="UTF-8"?><root allwords="696831"
> words="7617"/>
>
> However, commenting the second template out, I get a different result
> from the very same input:
>
> <?xml version="1.0" encoding="UTF-8"?><root allwords="531375"
> words="7620"/>
>
> To make it very clear, here are the transformation rules for the second
> results:
>
> <?xml version="1.0" encoding="UTF-8"?>
> <xsl:stylesheet version="2.0"
> xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
> xmlns:xs="http://www.w3.org/2001/XMLSchema" xmlns:fn="fn"
> exclude-result-prefixes="xs fn">
>
>    <xsl:output method="xml" encoding="UTF-8"/>
>    <xsl:strip-space elements="*"/>
>
>    <xsl:template match="/">
>      <xsl:variable name="allwords" as="xs:string+">
>        <xsl:apply-templates select="*" mode="lookup"/>
>      </xsl:variable>
>      <xsl:variable name="words" select="distinct-values($allwords)"/>
>      <root>
>        <xsl:attribute name="allwords" select="count($allwords)"/>
>        <xsl:attribute name="words" select="count($words)"/>
>      </root>
>    </xsl:template>
>
> <!--   <xsl:template match="*"> -->
> <!--       <xsl:value-of
> select="tokenize(text(),'[^A-Za-z0-9äöüßÄÖÜ]+')" /> -->
> <!--       <xsl:apply-templates select="*" /> -->
> <!--   </xsl:template> -->
>
>    <xsl:template match="*" mode="lookup">
>        <xsl:value-of select="tokenize(text(),'[^A-Za-z0-9äöüßÄÖÜ]+')"
> />
>        <xsl:apply-templates select="*" />
>    </xsl:template>
>
> </xsl:stylesheet>
>
> Question: what is the semantic difference between the two
> transformation
> rules that could explain the difference in the result?
>
> Kind Regards
>
>
> Rolf
>
> ------------------------------------------------------------------------------
> Check out the vibrant tech community on one of the world's most
> engaging tech sites, Slashdot.org! http://sdm.link/slashdot
> _______________________________________________
> saxon-help mailing list archived at http://saxon.markmail.org/
> [hidden email]
> https://lists.sourceforge.net/lists/listinfo/saxon-help

------------------------------------------------------------------------------
Check out the vibrant tech community on one of the world's most
engaging tech sites, Slashdot.org! http://sdm.link/slashdot
_______________________________________________
saxon-help mailing list archived at http://saxon.markmail.org/
[hidden email]
https://lists.sourceforge.net/lists/listinfo/saxon-help 
Reply | Threaded
Open this post in threaded view
|

Re: what causes the difference

Martin Honnen-2
In reply to this post by Rolf Schumacher-2
On 07.04.2017 12:20, Rolf Schumacher wrote:

> <?xml version="1.0" encoding="UTF-8"?>
> <xsl:stylesheet version="2.0"
> xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
> xmlns:xs="http://www.w3.org/2001/XMLSchema" xmlns:fn="fn"
> exclude-result-prefixes="xs fn">
>
>    <xsl:output method="xml" encoding="UTF-8"/>
>    <xsl:strip-space elements="*"/>
>
>    <xsl:template match="/">
>      <xsl:variable name="allwords" as="xs:string+">
>        <xsl:apply-templates select="*" mode="lookup"/>
>      </xsl:variable>
>      <xsl:variable name="words" select="distinct-values($allwords)"/>
>      <root>
>        <xsl:attribute name="allwords" select="count($allwords)"/>
>        <xsl:attribute name="words" select="count($words)"/>
>      </root>
>    </xsl:template>
>
>    <xsl:template match="*">
>        <xsl:value-of select="tokenize(text(),'[^A-Za-z0-9äöüßÄÖÜ]+')" />
>        <xsl:apply-templates select="*" />
>    </xsl:template>
>
>    <xsl:template match="*" mode="lookup">
>        <xsl:value-of select="tokenize(text(),'[^A-Za-z0-9äöüßÄÖÜ]+')" />
>        <xsl:apply-templates select="*" />

I think here you want to continue to use the mode 'lookup', no? So
change that to
          <xsl:apply-templates select="*" mode="#current"/>
or
          <xsl:apply-templates select="*" mode="lookup"/>




------------------------------------------------------------------------------
Check out the vibrant tech community on one of the world's most
engaging tech sites, Slashdot.org! http://sdm.link/slashdot
_______________________________________________
saxon-help mailing list archived at http://saxon.markmail.org/
[hidden email]
https://lists.sourceforge.net/lists/listinfo/saxon-help