Running Saxon Against A Large Number of Files

classic Classic list List threaded Threaded
6 messages Options
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Running Saxon Against A Large Number of Files

Eliot Kimber-2
I need to transform approximately 1 million files in the 1-4KB range. The transform requires loading a data file that’s used to modify the content (I’m obfuscating files by replacing text with random words pulled from an XML version of the standard linux words.txt file). The files are organized into directories about 4 or 5 deep and some directories may have 100s of 1000s of files. The processing is a simple identity transform that’s only handling text nodes.

I haven’t tried yet but I’m assuming that using Saxon’s collection() extensions to process the files would probably not handle 1 million files. So I could make the transform handle individual files and apply it using e.g., the find command but then Saxon has to reload each time (I assume). I could also use Ant but I’m not sure Ant could handle this number of files efficiently either. I could also write a Java wrapper to run Saxon and walk the directory tree.

I’m doing the processing in a macOS/linux environment. I can use the latest Saxon version.

What is the best approach in this case?

I guess I could also just write a Sax filter but ugh.

I want the transform to run as quickly as possible but memory usage may be a concern as I don’t have control over the amount of memory available on the machine that will ultimately run the transform (my shiny new MacBook Pro is probably much beefier than the target servers).

Thanks,

Eliot
--
Eliot Kimber
http://contrext.com
 




------------------------------------------------------------------------------
Check out the vibrant tech community on one of the world's most
engaging tech sites, SlashDot.org! http://sdm.link/slashdot
_______________________________________________
saxon-help mailing list archived at http://saxon.markmail.org/
[hidden email]
https://lists.sourceforge.net/lists/listinfo/saxon-help 
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: Running Saxon Against A Large Number of Files

Andrew Welch
Processing large directories of xml like this is the original reason I wrote Kernow:


Alternatively you can easily use Saxon's collection() function along with saxon:discard-document()

<xsl:for-each select="for $x in collection('file:///c:/path/to/xml?select=*.xml;recurse=yes;on-error=ignore') return saxon:discard-document($x)">

...you'd have to add some code to recreate the same directory structure in the output as your input, but that's easy enough


On 10 February 2017 at 16:23, Eliot Kimber <[hidden email]> wrote:
I need to transform approximately 1 million files in the 1-4KB range. The transform requires loading a data file that’s used to modify the content (I’m obfuscating files by replacing text with random words pulled from an XML version of the standard linux words.txt file). The files are organized into directories about 4 or 5 deep and some directories may have 100s of 1000s of files. The processing is a simple identity transform that’s only handling text nodes.

I haven’t tried yet but I’m assuming that using Saxon’s collection() extensions to process the files would probably not handle 1 million files. So I could make the transform handle individual files and apply it using e.g., the find command but then Saxon has to reload each time (I assume). I could also use Ant but I’m not sure Ant could handle this number of files efficiently either. I could also write a Java wrapper to run Saxon and walk the directory tree.

I’m doing the processing in a macOS/linux environment. I can use the latest Saxon version.

What is the best approach in this case?

I guess I could also just write a Sax filter but ugh.

I want the transform to run as quickly as possible but memory usage may be a concern as I don’t have control over the amount of memory available on the machine that will ultimately run the transform (my shiny new MacBook Pro is probably much beefier than the target servers).

Thanks,

Eliot
--
Eliot Kimber
http://contrext.com





------------------------------------------------------------------------------
Check out the vibrant tech community on one of the world's most
engaging tech sites, SlashDot.org! http://sdm.link/slashdot
_______________________________________________
saxon-help mailing list archived at http://saxon.markmail.org/
[hidden email]
https://lists.sourceforge.net/lists/listinfo/saxon-help



--

------------------------------------------------------------------------------
Check out the vibrant tech community on one of the world's most
engaging tech sites, SlashDot.org! http://sdm.link/slashdot
_______________________________________________
saxon-help mailing list archived at http://saxon.markmail.org/
[hidden email]
https://lists.sourceforge.net/lists/listinfo/saxon-help 
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: Running Saxon Against A Large Number of Files

Michael Kay
In reply to this post by Eliot Kimber-2
I can't see any particular reason why collection() shouldn't handle it, and it has the advantage that (on Saxon-EE at any rate) it will use multiple threads for the document parsing (equally xsl:result-document will use multiple threads for writing the result). The main thing is to use saxon:discard-document() to ensure that the documents are garbage collected after processing. I don't think any of the other approaches suggested will give you any multi-threading unless you work hard at it.

Michael Kay
Saxonica

> On 10 Feb 2017, at 16:23, Eliot Kimber <[hidden email]> wrote:
>
> I need to transform approximately 1 million files in the 1-4KB range. The transform requires loading a data file that’s used to modify the content (I’m obfuscating files by replacing text with random words pulled from an XML version of the standard linux words.txt file). The files are organized into directories about 4 or 5 deep and some directories may have 100s of 1000s of files. The processing is a simple identity transform that’s only handling text nodes.
>
> I haven’t tried yet but I’m assuming that using Saxon’s collection() extensions to process the files would probably not handle 1 million files. So I could make the transform handle individual files and apply it using e.g., the find command but then Saxon has to reload each time (I assume). I could also use Ant but I’m not sure Ant could handle this number of files efficiently either. I could also write a Java wrapper to run Saxon and walk the directory tree.
>
> I’m doing the processing in a macOS/linux environment. I can use the latest Saxon version.
>
> What is the best approach in this case?
>
> I guess I could also just write a Sax filter but ugh.
>
> I want the transform to run as quickly as possible but memory usage may be a concern as I don’t have control over the amount of memory available on the machine that will ultimately run the transform (my shiny new MacBook Pro is probably much beefier than the target servers).
>
> Thanks,
>
> Eliot
> --
> Eliot Kimber
> http://contrext.com
>
>
>
>
>
> ------------------------------------------------------------------------------
> Check out the vibrant tech community on one of the world's most
> engaging tech sites, SlashDot.org! http://sdm.link/slashdot
> _______________________________________________
> saxon-help mailing list archived at http://saxon.markmail.org/
> [hidden email]
> https://lists.sourceforge.net/lists/listinfo/saxon-help


------------------------------------------------------------------------------
Check out the vibrant tech community on one of the world's most
engaging tech sites, SlashDot.org! http://sdm.link/slashdot
_______________________________________________
saxon-help mailing list archived at http://saxon.markmail.org/
[hidden email]
https://lists.sourceforge.net/lists/listinfo/saxon-help 
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: Running Saxon Against A Large Number of Files

Eliot Kimber-2
OK, I’ll try collection() then. I guess I should have more faith ☺

Cheers,

E.

--
Eliot Kimber
http://contrext.com
 


On 2/10/17, 12:06 PM, "Michael Kay" <[hidden email]> wrote:

    I can't see any particular reason why collection() shouldn't handle it, and it has the advantage that (on Saxon-EE at any rate) it will use multiple threads for the document parsing (equally xsl:result-document will use multiple threads for writing the result). The main thing is to use saxon:discard-document() to ensure that the documents are garbage collected after processing. I don't think any of the other approaches suggested will give you any multi-threading unless you work hard at it.
   
    Michael Kay
    Saxonica
   
    > On 10 Feb 2017, at 16:23, Eliot Kimber <[hidden email]> wrote:
    >
    > I need to transform approximately 1 million files in the 1-4KB range. The transform requires loading a data file that’s used to modify the content (I’m obfuscating files by replacing text with random words pulled from an XML version of the standard linux words.txt file). The files are organized into directories about 4 or 5 deep and some directories may have 100s of 1000s of files. The processing is a simple identity transform that’s only handling text nodes.
    >
    > I haven’t tried yet but I’m assuming that using Saxon’s collection() extensions to process the files would probably not handle 1 million files. So I could make the transform handle individual files and apply it using e.g., the find command but then Saxon has to reload each time (I assume). I could also use Ant but I’m not sure Ant could handle this number of files efficiently either. I could also write a Java wrapper to run Saxon and walk the directory tree.
    >
    > I’m doing the processing in a macOS/linux environment. I can use the latest Saxon version.
    >
    > What is the best approach in this case?
    >
    > I guess I could also just write a Sax filter but ugh.
    >
    > I want the transform to run as quickly as possible but memory usage may be a concern as I don’t have control over the amount of memory available on the machine that will ultimately run the transform (my shiny new MacBook Pro is probably much beefier than the target servers).
    >
    > Thanks,
    >
    > Eliot
    > --
    > Eliot Kimber
    > http://contrext.com
    >
    >
    >
    >
    >
    > ------------------------------------------------------------------------------
    > Check out the vibrant tech community on one of the world's most
    > engaging tech sites, SlashDot.org! http://sdm.link/slashdot
    > _______________________________________________
    > saxon-help mailing list archived at http://saxon.markmail.org/
    > [hidden email]
    > https://lists.sourceforge.net/lists/listinfo/saxon-help
   
   
    ------------------------------------------------------------------------------
    Check out the vibrant tech community on one of the world's most
    engaging tech sites, SlashDot.org! http://sdm.link/slashdot
    _______________________________________________
    saxon-help mailing list archived at http://saxon.markmail.org/
    [hidden email]
    https://lists.sourceforge.net/lists/listinfo/saxon-help 



------------------------------------------------------------------------------
Check out the vibrant tech community on one of the world's most
engaging tech sites, SlashDot.org! http://sdm.link/slashdot
_______________________________________________
saxon-help mailing list archived at http://saxon.markmail.org/
[hidden email]
https://lists.sourceforge.net/lists/listinfo/saxon-help 
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: Running Saxon Against A Large Number of Files

Rob Koberg-2
Hi,

I saw a thread on the XSL list, but my response is more appropriate here.

When trying to transform 57315 XHTML and HTML documents, I get an "Too
many open files in system" error when just using the collection
function without saxon:discard-document(). This template produces the
error:

<xsl:template match="/conf" mode="gather-content">
  <xsl:param name="lookup-path"/>
  <xsl:variable name="col-xhtml"
select="collection(concat($lookup-path,
'?select=*.xhtml&amp;recurse=yes'))"/>
  <xsl:variable name="col-html"
select="collection(concat($lookup-path,
'?select=*.html&amp;recurse=yes'))"/>
  <!--<xsl:message>-->
    <!--Total XHTML: <xsl:value-of select="count($col-xhtml)"/>-->
    <!--Total HTML: <xsl:value-of select="count($col-html)"/>-->
    <!--Total: <xsl:value-of select="count($col-xhtml) + count($col-html)"/>-->
  <!--</xsl:message>-->
  <ignore>
    <xsl:apply-templates select="$col-xhtml/*"/>
    <xsl:apply-templates select="$col-html/*"/>
  </ignore>
</xsl:template>

The error:

(Too many open files in system)
  at xsl:apply-templates
(file:/Users/rkoberg/Sites/psc-ts/xsl/well-former-2-12.xsl#14)
     processing /conf
  in built-in template rule


This one works:

<xsl:template match="/conf" mode="gather-content">
  <xsl:param name="lookup-path"/>
  <xsl:variable name="col-xhtml"
select="collection(concat($lookup-path,
'?select=*.xhtml&amp;recurse=yes'))"/>
  <xsl:variable name="col-html"
select="collection(concat($lookup-path,
'?select=*.html&amp;recurse=yes'))"/>
  <ignore>
    <xsl:for-each
      select="
      for $malformed in $col-xhtml return saxon:discard-document($malformed)">
      <xsl:apply-templates/>
    </xsl:for-each>
    <xsl:for-each select="
      for $malformed in $col-html return saxon:discard-document($malformed)">
      <xsl:apply-templates/>
    </xsl:for-each>
  </ignore>
</xsl:template>


Btw, trying to use a regex in my "select" glob filter:
concat($lookup-path, '?select=*.(html|xhtml)&amp;recurse=yes')
resulted in the error:

Invalid relative URI "/Users/rkoberg/Documents/psoc-..." passed to
collection() function

On Fri, Feb 10, 2017 at 10:13 AM, Eliot Kimber <[hidden email]> wrote:

> OK, I’ll try collection() then. I guess I should have more faith ☺
>
> Cheers,
>
> E.
>
> --
> Eliot Kimber
> http://contrext.com
>
>
>
> On 2/10/17, 12:06 PM, "Michael Kay" <[hidden email]> wrote:
>
>     I can't see any particular reason why collection() shouldn't handle it, and it has the advantage that (on Saxon-EE at any rate) it will use multiple threads for the document parsing (equally xsl:result-document will use multiple threads for writing the result). The main thing is to use saxon:discard-document() to ensure that the documents are garbage collected after processing. I don't think any of the other approaches suggested will give you any multi-threading unless you work hard at it.
>
>     Michael Kay
>     Saxonica
>
>     > On 10 Feb 2017, at 16:23, Eliot Kimber <[hidden email]> wrote:
>     >
>     > I need to transform approximately 1 million files in the 1-4KB range. The transform requires loading a data file that’s used to modify the content (I’m obfuscating files by replacing text with random words pulled from an XML version of the standard linux words.txt file). The files are organized into directories about 4 or 5 deep and some directories may have 100s of 1000s of files. The processing is a simple identity transform that’s only handling text nodes.
>     >
>     > I haven’t tried yet but I’m assuming that using Saxon’s collection() extensions to process the files would probably not handle 1 million files. So I could make the transform handle individual files and apply it using e.g., the find command but then Saxon has to reload each time (I assume). I could also use Ant but I’m not sure Ant could handle this number of files efficiently either. I could also write a Java wrapper to run Saxon and walk the directory tree.
>     >
>     > I’m doing the processing in a macOS/linux environment. I can use the latest Saxon version.
>     >
>     > What is the best approach in this case?
>     >
>     > I guess I could also just write a Sax filter but ugh.
>     >
>     > I want the transform to run as quickly as possible but memory usage may be a concern as I don’t have control over the amount of memory available on the machine that will ultimately run the transform (my shiny new MacBook Pro is probably much beefier than the target servers).
>     >
>     > Thanks,
>     >
>     > Eliot
>     > --
>     > Eliot Kimber
>     > http://contrext.com
>     >
>     >
>     >
>     >
>     >
>     > ------------------------------------------------------------------------------
>     > Check out the vibrant tech community on one of the world's most
>     > engaging tech sites, SlashDot.org! http://sdm.link/slashdot
>     > _______________________________________________
>     > saxon-help mailing list archived at http://saxon.markmail.org/
>     > [hidden email]
>     > https://lists.sourceforge.net/lists/listinfo/saxon-help
>
>
>     ------------------------------------------------------------------------------
>     Check out the vibrant tech community on one of the world's most
>     engaging tech sites, SlashDot.org! http://sdm.link/slashdot
>     _______________________________________________
>     saxon-help mailing list archived at http://saxon.markmail.org/
>     [hidden email]
>     https://lists.sourceforge.net/lists/listinfo/saxon-help
>
>
>
> ------------------------------------------------------------------------------
> Check out the vibrant tech community on one of the world's most
> engaging tech sites, SlashDot.org! http://sdm.link/slashdot
> _______________________________________________
> saxon-help mailing list archived at http://saxon.markmail.org/
> [hidden email]
> https://lists.sourceforge.net/lists/listinfo/saxon-help

------------------------------------------------------------------------------
Check out the vibrant tech community on one of the world's most
engaging tech sites, SlashDot.org! http://sdm.link/slashdot
_______________________________________________
saxon-help mailing list archived at http://saxon.markmail.org/
[hidden email]
https://lists.sourceforge.net/lists/listinfo/saxon-help 
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: Running Saxon Against A Large Number of Files

Rob Koberg-2
Another btw: This might be a tagsoup XML parser issue? I am calling
the transform from the command line and setting tagsoup are the
parser:

java net.sf.saxon.Transform
-o:build/ignore_this_file_well-formed-2-12.html -s:local.config.xml
-xsl:xsl/well-former-2-12.xsl -x:org.ccil.cowan.tagsoup.Parser

On Tue, Feb 14, 2017 at 12:07 PM, Rob Koberg <[hidden email]> wrote:

> Hi,
>
> I saw a thread on the XSL list, but my response is more appropriate here.
>
> When trying to transform 57315 XHTML and HTML documents, I get an "Too
> many open files in system" error when just using the collection
> function without saxon:discard-document(). This template produces the
> error:
>
> <xsl:template match="/conf" mode="gather-content">
>   <xsl:param name="lookup-path"/>
>   <xsl:variable name="col-xhtml"
> select="collection(concat($lookup-path,
> '?select=*.xhtml&amp;recurse=yes'))"/>
>   <xsl:variable name="col-html"
> select="collection(concat($lookup-path,
> '?select=*.html&amp;recurse=yes'))"/>
>   <!--<xsl:message>-->
>     <!--Total XHTML: <xsl:value-of select="count($col-xhtml)"/>-->
>     <!--Total HTML: <xsl:value-of select="count($col-html)"/>-->
>     <!--Total: <xsl:value-of select="count($col-xhtml) + count($col-html)"/>-->
>   <!--</xsl:message>-->
>   <ignore>
>     <xsl:apply-templates select="$col-xhtml/*"/>
>     <xsl:apply-templates select="$col-html/*"/>
>   </ignore>
> </xsl:template>
>
> The error:
>
> (Too many open files in system)
>   at xsl:apply-templates
> (file:/Users/rkoberg/Sites/psc-ts/xsl/well-former-2-12.xsl#14)
>      processing /conf
>   in built-in template rule
>
>
> This one works:
>
> <xsl:template match="/conf" mode="gather-content">
>   <xsl:param name="lookup-path"/>
>   <xsl:variable name="col-xhtml"
> select="collection(concat($lookup-path,
> '?select=*.xhtml&amp;recurse=yes'))"/>
>   <xsl:variable name="col-html"
> select="collection(concat($lookup-path,
> '?select=*.html&amp;recurse=yes'))"/>
>   <ignore>
>     <xsl:for-each
>       select="
>       for $malformed in $col-xhtml return saxon:discard-document($malformed)">
>       <xsl:apply-templates/>
>     </xsl:for-each>
>     <xsl:for-each select="
>       for $malformed in $col-html return saxon:discard-document($malformed)">
>       <xsl:apply-templates/>
>     </xsl:for-each>
>   </ignore>
> </xsl:template>
>
>
> Btw, trying to use a regex in my "select" glob filter:
> concat($lookup-path, '?select=*.(html|xhtml)&amp;recurse=yes')
> resulted in the error:
>
> Invalid relative URI "/Users/rkoberg/Documents/psoc-..." passed to
> collection() function
>
> On Fri, Feb 10, 2017 at 10:13 AM, Eliot Kimber <[hidden email]> wrote:
>> OK, I’ll try collection() then. I guess I should have more faith ☺
>>
>> Cheers,
>>
>> E.
>>
>> --
>> Eliot Kimber
>> http://contrext.com
>>
>>
>>
>> On 2/10/17, 12:06 PM, "Michael Kay" <[hidden email]> wrote:
>>
>>     I can't see any particular reason why collection() shouldn't handle it, and it has the advantage that (on Saxon-EE at any rate) it will use multiple threads for the document parsing (equally xsl:result-document will use multiple threads for writing the result). The main thing is to use saxon:discard-document() to ensure that the documents are garbage collected after processing. I don't think any of the other approaches suggested will give you any multi-threading unless you work hard at it.
>>
>>     Michael Kay
>>     Saxonica
>>
>>     > On 10 Feb 2017, at 16:23, Eliot Kimber <[hidden email]> wrote:
>>     >
>>     > I need to transform approximately 1 million files in the 1-4KB range. The transform requires loading a data file that’s used to modify the content (I’m obfuscating files by replacing text with random words pulled from an XML version of the standard linux words.txt file). The files are organized into directories about 4 or 5 deep and some directories may have 100s of 1000s of files. The processing is a simple identity transform that’s only handling text nodes.
>>     >
>>     > I haven’t tried yet but I’m assuming that using Saxon’s collection() extensions to process the files would probably not handle 1 million files. So I could make the transform handle individual files and apply it using e.g., the find command but then Saxon has to reload each time (I assume). I could also use Ant but I’m not sure Ant could handle this number of files efficiently either. I could also write a Java wrapper to run Saxon and walk the directory tree.
>>     >
>>     > I’m doing the processing in a macOS/linux environment. I can use the latest Saxon version.
>>     >
>>     > What is the best approach in this case?
>>     >
>>     > I guess I could also just write a Sax filter but ugh.
>>     >
>>     > I want the transform to run as quickly as possible but memory usage may be a concern as I don’t have control over the amount of memory available on the machine that will ultimately run the transform (my shiny new MacBook Pro is probably much beefier than the target servers).
>>     >
>>     > Thanks,
>>     >
>>     > Eliot
>>     > --
>>     > Eliot Kimber
>>     > http://contrext.com
>>     >
>>     >
>>     >
>>     >
>>     >
>>     > ------------------------------------------------------------------------------
>>     > Check out the vibrant tech community on one of the world's most
>>     > engaging tech sites, SlashDot.org! http://sdm.link/slashdot
>>     > _______________________________________________
>>     > saxon-help mailing list archived at http://saxon.markmail.org/
>>     > [hidden email]
>>     > https://lists.sourceforge.net/lists/listinfo/saxon-help
>>
>>
>>     ------------------------------------------------------------------------------
>>     Check out the vibrant tech community on one of the world's most
>>     engaging tech sites, SlashDot.org! http://sdm.link/slashdot
>>     _______________________________________________
>>     saxon-help mailing list archived at http://saxon.markmail.org/
>>     [hidden email]
>>     https://lists.sourceforge.net/lists/listinfo/saxon-help
>>
>>
>>
>> ------------------------------------------------------------------------------
>> Check out the vibrant tech community on one of the world's most
>> engaging tech sites, SlashDot.org! http://sdm.link/slashdot
>> _______________________________________________
>> saxon-help mailing list archived at http://saxon.markmail.org/
>> [hidden email]
>> https://lists.sourceforge.net/lists/listinfo/saxon-help

------------------------------------------------------------------------------
Check out the vibrant tech community on one of the world's most
engaging tech sites, SlashDot.org! http://sdm.link/slashdot
_______________________________________________
saxon-help mailing list archived at http://saxon.markmail.org/
[hidden email]
https://lists.sourceforge.net/lists/listinfo/saxon-help 
Loading...