saxon;:discard-document

classic Classic list List threaded Threaded
15 messages Options
Reply | Threaded
Open this post in threaded view
|

saxon;:discard-document

David Carlisle


Possibly this is just user error, but I thought I'd comment anyway...

I had a stylesheet

<xsl:stylesheet xmlns:xsl="http://www.w3.org/1999/XSL/Transform" version="2.0"
xmlns:saxon="http://saxon.sf.net/"
>



<xsl:template name="main">
<xsl:for-each select="collection('/mathmldoc/c?select=*.xml;recurse=yes')/saxon:discard-document(.)">
<xsl:message>
[<xsl:value-of select="base-uri(.)"/>]
</xsl:message>
</xsl:for-each>
</xsl:template>

</xsl:stylesheet>

That worked Ok on some test files but failed as below on a real set of
3000 or so documents.

$ saxon8 -p -it main /enginecvs/doc/xsl/xhtml.xsl
Exception in thread "main" java.lang.OutOfMemoryError: Java heap space


I assumed that this meant the documents were not being discarded, and I
reshuffled the code to look like this, and it now works , so I'm happy,

<xsl:stylesheet xmlns:xsl="http://www.w3.org/1999/XSL/Transform" version="2.0"
xmlns:saxon="http://saxon.sf.net/"
>



<xsl:template name="main">
<xsl:for-each select="collection('/mathmldoc/c?select=*.xml;recurse=yes')">
<xsl:message>
[<xsl:value-of select="base-uri(saxon:discard-document(.))"/>]
</xsl:message>
</xsl:for-each>
</xsl:template>

</xsl:stylesheet>


But I thought it was worth asking if the first form should have worked (or
could be made to work) ?
 
David
(PS thanks for saxon:deep-equal I'll have to incorporate that into my
xquery test suite harness)

________________________________________________________________________
This e-mail has been scanned for all viruses by Star. The
service is powered by MessageLabs. For more information on a proactive
anti-virus service working around the clock, around the globe, visit:
http://www.star.net.uk
________________________________________________________________________


-------------------------------------------------------
This SF.net email is sponsored by: Splunk Inc. Do you grep through log files
for problems?  Stop!  Download the new AJAX search engine that makes
searching your log files as easy as surfing the  web.  DOWNLOAD SPLUNK!
http://ads.osdn.com/?ad_id=7637&alloc_id=16865&op=click
_______________________________________________
saxon-help mailing list
[hidden email]
https://lists.sourceforge.net/lists/listinfo/saxon-help
Reply | Threaded
Open this post in threaded view
|

Re: saxon;:discard-document

Andrew Welch
> Possibly this is just user error, but I thought I'd comment anyway...
>
> I had a stylesheet
>
> <xsl:stylesheet xmlns:xsl="http://www.w3.org/1999/XSL/Transform" version="2.0"
> xmlns:saxon="http://saxon.sf.net/"
> >
>
>
>
> <xsl:template name="main">
> <xsl:for-each select="collection('/mathmldoc/c?select=*.xml;recurse=yes')/saxon:discard-document(.)">
> <xsl:message>
> [<xsl:value-of select="base-uri(.)"/>]
> </xsl:message>
> </xsl:for-each>
> </xsl:template>
>
> </xsl:stylesheet>
>
> That worked Ok on some test files but failed as below on a real set of
> 3000 or so documents.
>
> $ saxon8 -p -it main /enginecvs/doc/xsl/xhtml.xsl
> Exception in thread "main" java.lang.OutOfMemoryError: Java heap space
>
>
> I assumed that this meant the documents were not being discarded, and I
> reshuffled the code to look like this, and it now works , so I'm happy,
>
> <xsl:stylesheet xmlns:xsl="http://www.w3.org/1999/XSL/Transform" version="2.0"
> xmlns:saxon="http://saxon.sf.net/"
> >
>
>
>
> <xsl:template name="main">
> <xsl:for-each select="collection('/mathmldoc/c?select=*.xml;recurse=yes')">
> <xsl:message>
> [<xsl:value-of select="base-uri(saxon:discard-document(.))"/>]
> </xsl:message>
> </xsl:for-each>
> </xsl:template>
>
> </xsl:stylesheet>
>
>
> But I thought it was worth asking if the first form should have worked (or
> could be made to work) ?

I always use:

for $x in collection(concat($directory-uri,
'?select=*.xml;recurse=yes;on-error=ignore'))
                        return saxon:discard-document($x)

which seems to work fine across 1000's of docs.


-------------------------------------------------------
This SF.net email is sponsored by: Splunk Inc. Do you grep through log files
for problems?  Stop!  Download the new AJAX search engine that makes
searching your log files as easy as surfing the  web.  DOWNLOAD SPLUNK!
<a href="http://ads.osdn.com/?ad_idv37&alloc_id865&op=click">http://ads.osdn.com/?ad_idv37&alloc_id865&op=click
_______________________________________________
saxon-help mailing list
[hidden email]
https://lists.sourceforge.net/lists/listinfo/saxon-help
Reply | Threaded
Open this post in threaded view
|

RE: saxon;:discard-document

Michael Kay
In reply to this post by David Carlisle
This is not a bug; just a failure to spot an opportunity for optimisation!

With the path expression collection(xyz)/discard-document(.) a sort is
necessary to get the results into document order. This explains why the "out
of memory" error occurs before any messages are output. The sort could be
avoided if Saxon knew enough about the behavior of the two function calls,
but the knowledge is not there: in particular, discard-document() is
implemented as a pure extension function (a call to a Java method) and there
is no static knowledge of its behaviour at all.

If you rewrite the expression as

for $f in collection(xyz) return discard-document($f)

then no sort is needed, and the evaluation can therefore be pipelined.

However, there is another bug in this area! If you leave out the
discard-document() entirely, then it works fine. It appears that the files
returned by the collection() function (at least with this kind of URI) are
not being held in the document map at all, so they are discarded by the
garbage collector at the first opportunity even in the absence of
discard-document(). As it happens, this is probably what most users would
want to happen, though it's not conformant with the spec which requires the
contents of the collection, as well as the individual documents within it,
to be stable for the life of the query/transformation. I'll have to think
about the best way to tackle this.

Michael Kay
http://www.saxonica.com/


> -----Original Message-----
> From: [hidden email]
> [mailto:[hidden email]] On Behalf Of
> David Carlisle
> Sent: 24 November 2005 15:05
> To: [hidden email]
> Subject: [saxon] saxon;:discard-document
>
>
>
> Possibly this is just user error, but I thought I'd comment anyway...
>
> I had a stylesheet
>
> <xsl:stylesheet
> xmlns:xsl="http://www.w3.org/1999/XSL/Transform" version="2.0"
> xmlns:saxon="http://saxon.sf.net/"
> >
>
>
>
> <xsl:template name="main">
> <xsl:for-each
> select="collection('/mathmldoc/c?select=*.xml;recurse=yes')/sa
xon:discard-document(.)">

> <xsl:message>
> [<xsl:value-of select="base-uri(.)"/>]
> </xsl:message>
> </xsl:for-each>
> </xsl:template>
>
> </xsl:stylesheet>
>
> That worked Ok on some test files but failed as below on a real set of
> 3000 or so documents.
>
> $ saxon8 -p -it main /enginecvs/doc/xsl/xhtml.xsl
> Exception in thread "main" java.lang.OutOfMemoryError: Java heap space
>
>
> I assumed that this meant the documents were not being
> discarded, and I
> reshuffled the code to look like this, and it now works , so
> I'm happy,
>
> <xsl:stylesheet
> xmlns:xsl="http://www.w3.org/1999/XSL/Transform" version="2.0"
> xmlns:saxon="http://saxon.sf.net/"
> >
>
>
>
> <xsl:template name="main">
> <xsl:for-each
> select="collection('/mathmldoc/c?select=*.xml;recurse=yes')">
> <xsl:message>
> [<xsl:value-of select="base-uri(saxon:discard-document(.))"/>]
> </xsl:message>
> </xsl:for-each>
> </xsl:template>
>
> </xsl:stylesheet>
>
>
> But I thought it was worth asking if the first form should
> have worked (or
> could be made to work) ?
>  
> David
> (PS thanks for saxon:deep-equal I'll have to incorporate that into my
> xquery test suite harness)
>
> ______________________________________________________________
> __________
> This e-mail has been scanned for all viruses by Star. The
> service is powered by MessageLabs. For more information on a proactive
> anti-virus service working around the clock, around the globe, visit:
> http://www.star.net.uk
> ______________________________________________________________
> __________
>
>
> -------------------------------------------------------
> This SF.net email is sponsored by: Splunk Inc. Do you grep
> through log files
> for problems?  Stop!  Download the new AJAX search engine that makes
> searching your log files as easy as surfing the  web.  
> DOWNLOAD SPLUNK!
> http://ads.osdn.com/?ad_id=7637&alloc_id=16865&op=click
> _______________________________________________
> saxon-help mailing list
> [hidden email]
> https://lists.sourceforge.net/lists/listinfo/saxon-help
>




-------------------------------------------------------
This SF.net email is sponsored by: Splunk Inc. Do you grep through log files
for problems?  Stop!  Download the new AJAX search engine that makes
searching your log files as easy as surfing the  web.  DOWNLOAD SPLUNK!
http://ads.osdn.com/?ad_id=7637&alloc_id=16865&op=click
_______________________________________________
saxon-help mailing list
[hidden email]
https://lists.sourceforge.net/lists/listinfo/saxon-help
Reply | Threaded
Open this post in threaded view
|

Re: saxon;:discard-document

David Carlisle
> This is not a bug; just a failure to spot an opportunity for
> optimisation!

Yes, hope I didn't imply otherwise.


> then no sort is needed, and the evaluation can therefore be pipelined.
yes thanks I guessed that was the problem when Andrew posted saying that
for... worked.

Somehow I find it easier to use  / than for (harder to get a spelling
error in a one character operator:-)

> However, there is another bug in this area! If you leave out the
> discard-document() entirely, then it works fine.

oops, I didn't try that! I don't think I'll rely on that, in case you
fix that bug. I don't think using saxon:discard-document is so bad (once
you find the idioms that work) although it may end up being xslt2's
xx:node-set() exslt:discard-document anyone?


David


________________________________________________________________________
This e-mail has been scanned for all viruses by Star. The
service is powered by MessageLabs. For more information on a proactive
anti-virus service working around the clock, around the globe, visit:
http://www.star.net.uk
________________________________________________________________________


-------------------------------------------------------
This SF.net email is sponsored by: Splunk Inc. Do you grep through log files
for problems?  Stop!  Download the new AJAX search engine that makes
searching your log files as easy as surfing the  web.  DOWNLOAD SPLUNK!
http://ads.osdn.com/?ad_id=7637&alloc_id=16865&op=click
_______________________________________________
saxon-help mailing list
[hidden email]
https://lists.sourceforge.net/lists/listinfo/saxon-help
Reply | Threaded
Open this post in threaded view
|

Re: saxon;:discard-document

Colin Paul Adams
>>>>> "David" == David Carlisle <[hidden email]> writes:

    David> exslt:discard-document anyone?

I still think it breaks compliance.
--
Colin Adams
Preston Lancashire


-------------------------------------------------------
This SF.net email is sponsored by: Splunk Inc. Do you grep through log files
for problems?  Stop!  Download the new AJAX search engine that makes
searching your log files as easy as surfing the  web.  DOWNLOAD SPLUNK!
http://ads.osdn.com/?ad_id=7637&alloc_id=16865&op=click
_______________________________________________
saxon-help mailing list
[hidden email]
https://lists.sourceforge.net/lists/listinfo/saxon-help
Reply | Threaded
Open this post in threaded view
|

Re: saxon;:discard-document

Andrew Welch
>     David> exslt:discard-document anyone?

That's what I was thinking.  I did wonder how it didn't make it into
the language itself, as it kind of goes hand in hand with collection.
I would have flipped it though - collection() would automatically
discard the documents unless told otherwise.

Processing whole directories of XML just with xslt 2.0 is such a great
thing, I just wish I could comprehend all the possibilities.


-------------------------------------------------------
This SF.net email is sponsored by: Splunk Inc. Do you grep through log files
for problems?  Stop!  Download the new AJAX search engine that makes
searching your log files as easy as surfing the  web.  DOWNLOAD SPLUNK!
<a href="http://ads.osdn.com/?ad_idv37&alloc_id865&op=click">http://ads.osdn.com/?ad_idv37&alloc_id865&op=click
_______________________________________________
saxon-help mailing list
[hidden email]
https://lists.sourceforge.net/lists/listinfo/saxon-help
Reply | Threaded
Open this post in threaded view
|

Re: saxon;:discard-document

Colin Paul Adams
>>>>> "Andrew" == andrew welch <[hidden email]> writes:

    David> exslt:discard-document anyone?
    Andrew> That's what I was thinking.  I did wonder how it didn't
    Andrew> make it into the language itself, as it kind of goes hand
    Andrew> in hand with collection.  I would have flipped it though -
    Andrew> collection() would automatically discard the documents
    Andrew> unless told otherwise.

But XSLT guarantees node identity - so discarding a document, and then
re-reading it later, breaks this.
So it is not possible for this to be in the language.
--
Colin Adams
Preston Lancashire


-------------------------------------------------------
This SF.net email is sponsored by: Splunk Inc. Do you grep through log files
for problems?  Stop!  Download the new AJAX search engine that makes
searching your log files as easy as surfing the  web.  DOWNLOAD SPLUNK!
http://ads.osdn.com/?ad_id=7637&alloc_id=16865&op=click
_______________________________________________
saxon-help mailing list
[hidden email]
https://lists.sourceforge.net/lists/listinfo/saxon-help
Reply | Threaded
Open this post in threaded view
|

Re: saxon;:discard-document

Andrew Welch
>     David> exslt:discard-document anyone?
>     Andrew> That's what I was thinking.  I did wonder how it didn't
>     Andrew> make it into the language itself, as it kind of goes hand
>     Andrew> in hand with collection.  I would have flipped it though -
>     Andrew> collection() would automatically discard the documents
>     Andrew> unless told otherwise.
>
> But XSLT guarantees node identity - so discarding a document, and then
> re-reading it later, breaks this.
> So it is not possible for this to be in the language.

Ahh I see.  It fits perfectly as a Saxon extension then.


-------------------------------------------------------
This SF.net email is sponsored by: Splunk Inc. Do you grep through log files
for problems?  Stop!  Download the new AJAX search engine that makes
searching your log files as easy as surfing the  web.  DOWNLOAD SPLUNK!
<a href="http://ads.osdn.com/?ad_idv37&alloc_id865&op=click">http://ads.osdn.com/?ad_idv37&alloc_id865&op=click
_______________________________________________
saxon-help mailing list
[hidden email]
https://lists.sourceforge.net/lists/listinfo/saxon-help
Reply | Threaded
Open this post in threaded view
|

Re: saxon;:discard-document

David Carlisle
In reply to this post by Colin Paul Adams


> But XSLT guarantees node identity - so discarding a document, and then
> re-reading it later, breaks this.
> So it is not possible for this to be in the language.

I disagree. It would be bad if an XDM node could somehow lose its
identity and gain a new one, but that is not what  discard-document
does, it just loses the assurance that if you generate an XDM tree
twice from the same URI then you generate identical nodes.

It means that collection() (and doc() etc) are no longer pure functions,
as doc($uri) may give different results in different parts of the
stylesheet, but it's hard to come up with real cases where this is a
problem, as you only get a different result on the second call if the
result of the first has already gone out of scope, so you can't
easily compare the two.

Without this,  collection($uri) and document($indexdoc) are only really
any use on large collections referenced by $uri or $indexdoc if the
system can automatically (statically) determine that no document is
referenced from two different scopes, and so each document can be
released once used. But that seems to be asking a lot. (There are other
possibilities, such as recognsing that the operations on the documents
don't depend on node identity, but these don't seem any easier to
determine automatically)


I think something like this could be in the language, it's rather like
xquery's unordered{}: it can be specified such that a compliant
implementation may essentially ignore it but it allows the optimiser to
shake things up rather more than usual (getting potentially different
results)

David

________________________________________________________________________
This e-mail has been scanned for all viruses by Star. The
service is powered by MessageLabs. For more information on a proactive
anti-virus service working around the clock, around the globe, visit:
http://www.star.net.uk
________________________________________________________________________


-------------------------------------------------------
This SF.net email is sponsored by: Splunk Inc. Do you grep through log files
for problems?  Stop!  Download the new AJAX search engine that makes
searching your log files as easy as surfing the  web.  DOWNLOAD SPLUNK!
http://ads.osdn.com/?ad_id=7637&alloc_id=16865&op=click
_______________________________________________
saxon-help mailing list
[hidden email]
https://lists.sourceforge.net/lists/listinfo/saxon-help
Reply | Threaded
Open this post in threaded view
|

Re: saxon;:discard-document

David Carlisle

or, to put it another way, a non-extension function version of
discard-document could be defined to do a deep copy (ie same as
xsl:copy-of) That would have the same effect of allowing the optimiser
to throw away the original tree (as it is never referenced). It would be
different from discard-document as it would _ensure_ that you got a
different node identity each time.

<xsl:for-each select="collection(...)/deep-copy(.)">
 ....

each time you run deep-copy(.) on a tree you get new node ids. So long
as you only ever accessed the copied tree, this would (I think) mean
that really the system wouldn't need to copy it at all and would just do
discard-document, that is just use the existing tree wthout copying, but
build it again if needed again,

David



________________________________________________________________________
This e-mail has been scanned for all viruses by Star. The
service is powered by MessageLabs. For more information on a proactive
anti-virus service working around the clock, around the globe, visit:
http://www.star.net.uk
________________________________________________________________________


-------------------------------------------------------
This SF.net email is sponsored by: Splunk Inc. Do you grep through log files
for problems?  Stop!  Download the new AJAX search engine that makes
searching your log files as easy as surfing the  web.  DOWNLOAD SPLUNK!
http://ads.osdn.com/?ad_id=7637&alloc_id=16865&op=click
_______________________________________________
saxon-help mailing list
[hidden email]
https://lists.sourceforge.net/lists/listinfo/saxon-help
Reply | Threaded
Open this post in threaded view
|

Re: saxon;:discard-document

Colin Paul Adams
In reply to this post by David Carlisle
>>>>> "David" == David Carlisle <[hidden email]> writes:

    David> I disagree. It would be bad if an XDM node could somehow
    David> lose its identity and gain a new one, but that is not what
    David> discard-document does, it just loses the assurance that if
    David> you generate an XDM tree twice from the same URI then you
    David> generate identical nodes.

Which is an assurance guaranteed by the language.

    David> different parts of the stylesheet, but it's hard to come up
    David> with real cases where this is a problem, as you only get a
    David> different result on the second call if the result of the
    David> first has already gone out of scope, so you can't easily
    David> compare the two.

But you can compare the results of generate-id() on the node - this
must be the same each time.

I guess it would be possible to implement generate-id() in a way that
would preserve this property, but maybe this is not possible for all
implementations.
--
Colin Adams
Preston Lancashire


-------------------------------------------------------
This SF.net email is sponsored by: Splunk Inc. Do you grep through log files
for problems?  Stop!  Download the new AJAX search engine that makes
searching your log files as easy as surfing the  web.  DOWNLOAD SPLUNK!
http://ads.osdn.com/?ad_id=7637&alloc_id=16865&op=click
_______________________________________________
saxon-help mailing list
[hidden email]
https://lists.sourceforge.net/lists/listinfo/saxon-help
Reply | Threaded
Open this post in threaded view
|

Re: saxon;:discard-document

David Carlisle

> But you can compare the results of generate-id() on the node - this
> must be the same each time.

Yes if it's the same node.

See the follow up reply suggesting a semantics for discard-document that
it is (or could be) essentially xsl:copy-of. That is the, whole point is
that it is not the same node each time.

David

________________________________________________________________________
This e-mail has been scanned for all viruses by Star. The
service is powered by MessageLabs. For more information on a proactive
anti-virus service working around the clock, around the globe, visit:
http://www.star.net.uk
________________________________________________________________________


-------------------------------------------------------
This SF.net email is sponsored by: Splunk Inc. Do you grep through log files
for problems?  Stop!  Download the new AJAX search engine that makes
searching your log files as easy as surfing the  web.  DOWNLOAD SPLUNK!
http://ads.osdn.com/?ad_id=7637&alloc_id=16865&op=click
_______________________________________________
saxon-help mailing list
[hidden email]
https://lists.sourceforge.net/lists/listinfo/saxon-help
Reply | Threaded
Open this post in threaded view
|

Re: saxon;:discard-document

Colin Paul Adams
In reply to this post by David Carlisle
>>>>> "David" == David Carlisle <[hidden email]> writes:

    David> or, to put it another way, a non-extension function version
    David> of discard-document could be defined to do a deep copy (ie
    David> same as xsl:copy-of) That would have the same effect of
    David> allowing the optimiser to throw away the original tree (as
    David> it is never referenced). It would be different from
    David> discard-document as it would _ensure_ that you got a
    David> different node identity each time.

    David> <xsl:for-each select="collection(...)/deep-copy(.)"> ....

    David> each time you run deep-copy(.) on a tree you get new node
    David> ids. So long as you only ever accessed the copied tree,
    David> this would (I think) mean that really the system wouldn't
    David> need to copy it at all and would just do discard-document,
    David> that is just use the existing tree wthout copying, but
    David> build it again if needed again,

Ingenious.
But what is the system-id of the document that is deep-copied?
It can't be the same as the copied document.
--
Colin Adams
Preston Lancashire


-------------------------------------------------------
This SF.net email is sponsored by: Splunk Inc. Do you grep through log files
for problems?  Stop!  Download the new AJAX search engine that makes
searching your log files as easy as surfing the  web.  DOWNLOAD SPLUNK!
http://ads.osdn.com/?ad_id=7637&alloc_id=16865&op=click
_______________________________________________
saxon-help mailing list
[hidden email]
https://lists.sourceforge.net/lists/listinfo/saxon-help
Reply | Threaded
Open this post in threaded view
|

Re: saxon;:discard-document

David Carlisle

> But what is the system-id of the document that is deep-copied?

You mean the document-uri() accessor in the data model ? It's whatever
the person (or Working group) that's specifying the function says it is.

>It can't be the same as the copied document


why not? If it was _exactly_ xsl:copy-of it wouldn't be the same, but
for this use I would think having it be the same would be most useful.
there is nothing new about having multiple trees with the same base uri
just load two different trees with the same xml:base attribute. And I
don't see having two trees with the same document-uri as being any
different architecturally.


Actually that would breal the DM constraint

  In other words, for any Document Node $arg, either fn:document-uri($arg)
  must return the empty sequence or fn:doc(fn:document-uri($arg)) must
  return $arg.

so either that constraint must change, or (possibly safer) the
document-uri property of the new document would be () and people would
have to rely on base-uri()

This isn't a fully worked out proposal, all I was suggesting was that
something like discard-document could be added to the languuage even if
that needed a few small changes elsewhere without totally breaking the
language design. It's not like saxon:assign:-) It's not even anything
like as disruptive as xquery unordered{..} which makes just about every
operation non-deterministic, you can't rely on
$seq[1] returning the same node as $seq[1] even if used in the same
expression.


David

________________________________________________________________________
This e-mail has been scanned for all viruses by Star. The
service is powered by MessageLabs. For more information on a proactive
anti-virus service working around the clock, around the globe, visit:
http://www.star.net.uk
________________________________________________________________________


-------------------------------------------------------
This SF.net email is sponsored by: Splunk Inc. Do you grep through log files
for problems?  Stop!  Download the new AJAX search engine that makes
searching your log files as easy as surfing the  web.  DOWNLOAD SPLUNK!
http://ads.osdn.com/?ad_id=7637&alloc_id=16865&op=click
_______________________________________________
saxon-help mailing list
[hidden email]
https://lists.sourceforge.net/lists/listinfo/saxon-help
Reply | Threaded
Open this post in threaded view
|

Re: saxon;:discard-document

Colin Paul Adams
>>>>> "David" == David Carlisle <[hidden email]> writes:

    >> But what is the system-id of the document that is deep-copied?

    David> You mean the document-uri() accessor in the data model ?
    David> It's whatever the person (or Working group) that's
    David> specifying the function says it is.

    >> It can't be the same as the copied document


    David> why not?

Because then the node-identity is truely broken!

    David> This isn't a fully worked out proposal,

I know, so I'm giving you feedback.
--
Colin Adams
Preston Lancashire


-------------------------------------------------------
This SF.net email is sponsored by: Splunk Inc. Do you grep through log files
for problems?  Stop!  Download the new AJAX search engine that makes
searching your log files as easy as surfing the  web.  DOWNLOAD SPLUNK!
http://ads.osdn.com/?ad_id=7637&alloc_id=16865&op=click
_______________________________________________
saxon-help mailing list
[hidden email]
https://lists.sourceforge.net/lists/listinfo/saxon-help