XsltTransformer.InputXmlResolver on .NET ignored?

classic Classic list List threaded Threaded
7 messages Options
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

XsltTransformer.InputXmlResolver on .NET ignored?

Emanuel Wlaschitz
I'm not exactly sure if this is an issue on our side, so I'm writing to the ML first before creating an issue report.

In general, we try to redirect as many calls as possible through our own XmlResolver so we can avoid leaving localhost as much as possible (by probing a series of TR-9401 and XML catalogs, then trying some other well-known locations relative to our assemblies), so our usual pattern looks like this:

[...]
var transformer = xsltExecutable.Load();
// we try to resolve things on our own here, and fall back to the original resolver (which seems to be an XmlUrlResolver) if we can't
transformer.InputXmlResolver = new OurOwnResolver(transformer.InputXmlResolver);
transformer.SetInputStream(fileStream, fileUri);
var destination = PrepareDestination(...);
transformer.Run(destination);

...at least we thought we were.
At one point, our internet connection was a little flaky and random transformations started to fail. Looking closer, the affected input XML files looked a little like this:

<!DOCTYPE root [
   <!ENTITY % something PUBLIC "-//SOME//ENTITIES Formal Public Identifier//EN//XML" "http://some.server.tld/entities/fpi">
   %something;
]>
<root.../>

Both "-//SOME//ENTITIES Formal Public Identifier//EN//XML" and "http://some.server.tld/entities/fpi" were part of our internal probing and should have been redirected to local copies, but in fact it was trying to download the file from http://some.server.tld and failed for the lack of a working connection at that time.

With some debugging, we found out that even though the input resolver is set, it isn't being called. The StackTrace looks like this (dumbed down to the classes it goes thru, I can add the full one if necessary):
> Our Code that calls transformer.Run
> XsltTransformer.Run
> net.sf.saxon.Controller/net.sf.saxon.event.Sender
> org.apache.xerces.jaxp.SAXParserImpl.JAXPSAXParser/org.apache.xerces.parsers.XMLParser
> org.apache.xerces.impl.XMLDocumentFragmentScannerImpl/org.apache.xerces.impl.XMLDTDScannerImpl
> org.apache.xerces.impl.XMLEntityManager
> sun.net.www.protocol.http.HttpURLConnection -> java.io.FileNotFoundException

The resolver is in fact called once, but with the Stylesheet source as Base Uri and a relative path being loaded using the document() function (based on the stylesheet path, not related to the source XML at all).
No calls are being made for the XML itself, even when I add a Public or System ID.

This is using Saxon-HE 9.6.0.6 running on .NET 4.5; and I'm pretty sure this should have been working already (otherwise we would've chosen a different approach there, for example using XsltTransformer.InitialContextNode instead of XsltTransformer.SetInputStream if this wasn't working before)...then again, that part of the code has been around since 9.4 running on .NET 3.5 (or even earlier).
And apparently, this still happens with Saxon-HE 9.7.0.7 (which seems to be the most recent one available on SourceForge).

Saxon issue, or am I missing something?

Thanks for reading.
Regards, Emanuel
------------------------------------------------------------------------------
What NetFlow Analyzer can do for you? Monitors network bandwidth and traffic
patterns at an interface-level. Reveals which users, apps, and protocols are
consuming the most bandwidth. Provides multi-vendor support for NetFlow,
J-Flow, sFlow and other flows. Make informed decisions using capacity
planning reports. http://sdm.link/zohodev2dev
_______________________________________________
saxon-help mailing list archived at http://saxon.markmail.org/
[hidden email]
https://lists.sourceforge.net/lists/listinfo/saxon-help 
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: XsltTransformer.InputXmlResolver on .NET ignored?

Michael Kay
The documentation for InputXmlResolver says that it's used for the URIs supplied to the doc() and document() functions. I don't think that's actually a complete list of the places it is used.

Looking at the code, I think that when we use the Microsoft parser to parse source documents, we set the XmlResolver on that parser, which means it should get used to resolve references to external entities such as DTDs. But I think that when we use the Apache parser (which is the default), we don't: that reflects the Java situation where the resolvers used by the parser and the XSLT processor have quite different interfaces.

The easiest workaround is probably to build the source document yourself using a DocumentBuilder - this is documented (apparently correctly) to use its XmlResolver for resolving external entity references, whichever parser is used. You can then supply the resulting XdmNode to XsltTransformer.InitialContextNode.

Michael Kay
Saxonica

> On 11 Aug 2016, at 13:44, Emanuel Wlaschitz <[hidden email]> wrote:
>
> I'm not exactly sure if this is an issue on our side, so I'm writing to the ML first before creating an issue report.
>
> In general, we try to redirect as many calls as possible through our own XmlResolver so we can avoid leaving localhost as much as possible (by probing a series of TR-9401 and XML catalogs, then trying some other well-known locations relative to our assemblies), so our usual pattern looks like this:
>
> [...]
> var transformer = xsltExecutable.Load();
> // we try to resolve things on our own here, and fall back to the original resolver (which seems to be an XmlUrlResolver) if we can't
> transformer.InputXmlResolver = new OurOwnResolver(transformer.InputXmlResolver);
> transformer.SetInputStream(fileStream, fileUri);
> var destination = PrepareDestination(...);
> transformer.Run(destination);
>
> ...at least we thought we were.
> At one point, our internet connection was a little flaky and random transformations started to fail. Looking closer, the affected input XML files looked a little like this:
>
> <!DOCTYPE root [
>   <!ENTITY % something PUBLIC "-//SOME//ENTITIES Formal Public Identifier//EN//XML" "http://some.server.tld/entities/fpi">
>   %something;
> ]>
> <root.../>
>
> Both "-//SOME//ENTITIES Formal Public Identifier//EN//XML" and "http://some.server.tld/entities/fpi" were part of our internal probing and should have been redirected to local copies, but in fact it was trying to download the file from http://some.server.tld and failed for the lack of a working connection at that time.
>
> With some debugging, we found out that even though the input resolver is set, it isn't being called. The StackTrace looks like this (dumbed down to the classes it goes thru, I can add the full one if necessary):
>> Our Code that calls transformer.Run
>> XsltTransformer.Run
>> net.sf.saxon.Controller/net.sf.saxon.event.Sender
>> org.apache.xerces.jaxp.SAXParserImpl.JAXPSAXParser/org.apache.xerces.parsers.XMLParser
>> org.apache.xerces.impl.XMLDocumentFragmentScannerImpl/org.apache.xerces.impl.XMLDTDScannerImpl
>> org.apache.xerces.impl.XMLEntityManager
>> sun.net.www.protocol.http.HttpURLConnection -> java.io.FileNotFoundException
>
> The resolver is in fact called once, but with the Stylesheet source as Base Uri and a relative path being loaded using the document() function (based on the stylesheet path, not related to the source XML at all).
> No calls are being made for the XML itself, even when I add a Public or System ID.
>
> This is using Saxon-HE 9.6.0.6 running on .NET 4.5; and I'm pretty sure this should have been working already (otherwise we would've chosen a different approach there, for example using XsltTransformer.InitialContextNode instead of XsltTransformer.SetInputStream if this wasn't working before)...then again, that part of the code has been around since 9.4 running on .NET 3.5 (or even earlier).
> And apparently, this still happens with Saxon-HE 9.7.0.7 (which seems to be the most recent one available on SourceForge).
>
> Saxon issue, or am I missing something?
>
> Thanks for reading.
> Regards, Emanuel
> ------------------------------------------------------------------------------
> What NetFlow Analyzer can do for you? Monitors network bandwidth and traffic
> patterns at an interface-level. Reveals which users, apps, and protocols are
> consuming the most bandwidth. Provides multi-vendor support for NetFlow,
> J-Flow, sFlow and other flows. Make informed decisions using capacity
> planning reports. http://sdm.link/zohodev2dev
> _______________________________________________
> saxon-help mailing list archived at http://saxon.markmail.org/
> [hidden email]
> https://lists.sourceforge.net/lists/listinfo/saxon-help 



------------------------------------------------------------------------------
What NetFlow Analyzer can do for you? Monitors network bandwidth and traffic
patterns at an interface-level. Reveals which users, apps, and protocols are
consuming the most bandwidth. Provides multi-vendor support for NetFlow,
J-Flow, sFlow and other flows. Make informed decisions using capacity
planning reports. http://sdm.link/zohodev2dev
_______________________________________________
saxon-help mailing list archived at http://saxon.markmail.org/
[hidden email]
https://lists.sourceforge.net/lists/listinfo/saxon-help 
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: XsltTransformer.InputXmlResolver on .NET ignored?

Emanuel Wlaschitz
We're doing this in other places already (mostly for in-memory transformations), but replaced it for a few cases where we had files available. Not sure if you remember, but we had some performance issues using InitialContextNode before (where O'Neil investigated, on 9.4 I believe?) so we switched to SetInputStream for a huge performance improvement.
I'll see if I can find the old samples to re-run them on 9.6 and 9.7 to find out whether this is still true or not. It did make a difference by hours on larger inputs (>100MB) back then; if it is negligible now I suppose we can switch back again.

Thanks for the info!

Regards, Emanuel

-----Original Message-----
From: Michael Kay [mailto:[hidden email]]
Sent: Thursday, August 11, 2016 15:56
To: Mailing list for the SAXON XSLT and XQuery processor <[hidden email]>
Subject: Re: [saxon] XsltTransformer.InputXmlResolver on .NET ignored?

The documentation for InputXmlResolver says that it's used for the URIs supplied to the doc() and document() functions. I don't think that's actually a complete list of the places it is used.

Looking at the code, I think that when we use the Microsoft parser to parse source documents, we set the XmlResolver on that parser, which means it should get used to resolve references to external entities such as DTDs. But I think that when we use the Apache parser (which is the default), we don't: that reflects the Java situation where the resolvers used by the parser and the XSLT processor have quite different interfaces.

The easiest workaround is probably to build the source document yourself using a DocumentBuilder - this is documented (apparently correctly) to use its XmlResolver for resolving external entity references, whichever parser is used. You can then supply the resulting XdmNode to XsltTransformer.InitialContextNode.

Michael Kay
Saxonica

> On 11 Aug 2016, at 13:44, Emanuel Wlaschitz <[hidden email]> wrote:
>
> I'm not exactly sure if this is an issue on our side, so I'm writing to the ML first before creating an issue report.
>
> In general, we try to redirect as many calls as possible through our own XmlResolver so we can avoid leaving localhost as much as possible (by probing a series of TR-9401 and XML catalogs, then trying some other well-known locations relative to our assemblies), so our usual pattern looks like this:
>
> [...]
> var transformer = xsltExecutable.Load(); // we try to resolve things
> on our own here, and fall back to the original resolver (which seems
> to be an XmlUrlResolver) if we can't transformer.InputXmlResolver =
> new OurOwnResolver(transformer.InputXmlResolver);
> transformer.SetInputStream(fileStream, fileUri); var destination =
> PrepareDestination(...); transformer.Run(destination);
>
> ...at least we thought we were.
> At one point, our internet connection was a little flaky and random transformations started to fail. Looking closer, the affected input XML files looked a little like this:
>
> <!DOCTYPE root [
>   <!ENTITY % something PUBLIC "-//SOME//ENTITIES Formal Public Identifier//EN//XML" "http://some.server.tld/entities/fpi">
>   %something;
> ]>
> <root.../>
>
> Both "-//SOME//ENTITIES Formal Public Identifier//EN//XML" and "http://some.server.tld/entities/fpi" were part of our internal probing and should have been redirected to local copies, but in fact it was trying to download the file from http://some.server.tld and failed for the lack of a working connection at that time.
>
> With some debugging, we found out that even though the input resolver is set, it isn't being called. The StackTrace looks like this (dumbed down to the classes it goes thru, I can add the full one if necessary):
>> Our Code that calls transformer.Run
>> XsltTransformer.Run
>> net.sf.saxon.Controller/net.sf.saxon.event.Sender
>> org.apache.xerces.jaxp.SAXParserImpl.JAXPSAXParser/org.apache.xerces.
>> parsers.XMLParser
>> org.apache.xerces.impl.XMLDocumentFragmentScannerImpl/org.apache.xerc
>> es.impl.XMLDTDScannerImpl org.apache.xerces.impl.XMLEntityManager
>> sun.net.www.protocol.http.HttpURLConnection ->
>> java.io.FileNotFoundException
>
> The resolver is in fact called once, but with the Stylesheet source as Base Uri and a relative path being loaded using the document() function (based on the stylesheet path, not related to the source XML at all).
> No calls are being made for the XML itself, even when I add a Public or System ID.
>
> This is using Saxon-HE 9.6.0.6 running on .NET 4.5; and I'm pretty sure this should have been working already (otherwise we would've chosen a different approach there, for example using XsltTransformer.InitialContextNode instead of XsltTransformer.SetInputStream if this wasn't working before)...then again, that part of the code has been around since 9.4 running on .NET 3.5 (or even earlier).
> And apparently, this still happens with Saxon-HE 9.7.0.7 (which seems to be the most recent one available on SourceForge).
>
> Saxon issue, or am I missing something?
>
> Thanks for reading.
> Regards, Emanuel
> ----------------------------------------------------------------------
> -------- What NetFlow Analyzer can do for you? Monitors network
> bandwidth and traffic patterns at an interface-level. Reveals which
> users, apps, and protocols are consuming the most bandwidth. Provides
> multi-vendor support for NetFlow, J-Flow, sFlow and other flows. Make
> informed decisions using capacity planning reports.
> http://sdm.link/zohodev2dev 
> _______________________________________________
> saxon-help mailing list archived at http://saxon.markmail.org/ 
> [hidden email]
> https://lists.sourceforge.net/lists/listinfo/saxon-help



------------------------------------------------------------------------------
What NetFlow Analyzer can do for you? Monitors network bandwidth and traffic patterns at an interface-level. Reveals which users, apps, and protocols are consuming the most bandwidth. Provides multi-vendor support for NetFlow, J-Flow, sFlow and other flows. Make informed decisions using capacity planning reports. http://sdm.link/zohodev2dev _______________________________________________
saxon-help mailing list archived at http://saxon.markmail.org/ [hidden email] https://lists.sourceforge.net/lists/listinfo/saxon-help 
------------------------------------------------------------------------------
What NetFlow Analyzer can do for you? Monitors network bandwidth and traffic
patterns at an interface-level. Reveals which users, apps, and protocols are
consuming the most bandwidth. Provides multi-vendor support for NetFlow,
J-Flow, sFlow and other flows. Make informed decisions using capacity
planning reports. http://sdm.link/zohodev2dev
_______________________________________________
saxon-help mailing list archived at http://saxon.markmail.org/
[hidden email]
https://lists.sourceforge.net/lists/listinfo/saxon-help 
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: XsltTransformer.InputXmlResolver on .NET ignored?

Michael Kay
OK, thanks, I had forgotten that bit of history. I've just found a 2013 thread where we identified that doing whitespace stripping on the fly was much faster than stripping it dynamically from an existing tree, and that could be the issue you're talking about. If you really need to have whitespace-stripping rules more complex than those offered by the DocumentBuilder (all, ignorable, or none) then I see this could be an issue with this workaround.

Any other workarounds I can think of involve diving down into rather low-level (Java-oriented) APIs. For example you could try

net.sf.saxon.om.NodeInfo doc =
   transformer.Implementation.prepareInputTree(
     new org.xml.sax.stream.StreamSource(
       new net.sf.saxon.dotnet.DotNetInputStream(stream), baseUri));
transformer.InitialContextNode = (XdmNode)XdmValue.wrap(doc);

The other thing we need to think about is whether to treat this as a bug. If it's a change between 9.6 and 9.7 then we should definitely do so. If the current behaviour has been in the product for some time, then given that it's consistent with the documentation, it would be safer not to change it.

Michael Kay
Saxonica


> On 11 Aug 2016, at 15:23, Emanuel Wlaschitz <[hidden email]> wrote:
>
> We're doing this in other places already (mostly for in-memory transformations), but replaced it for a few cases where we had files available. Not sure if you remember, but we had some performance issues using InitialContextNode before (where O'Neil investigated, on 9.4 I believe?) so we switched to SetInputStream for a huge performance improvement.
> I'll see if I can find the old samples to re-run them on 9.6 and 9.7 to find out whether this is still true or not. It did make a difference by hours on larger inputs (>100MB) back then; if it is negligible now I suppose we can switch back again.
>
> Thanks for the info!
>
> Regards, Emanuel
>
> -----Original Message-----
> From: Michael Kay [mailto:[hidden email]]
> Sent: Thursday, August 11, 2016 15:56
> To: Mailing list for the SAXON XSLT and XQuery processor <[hidden email]>
> Subject: Re: [saxon] XsltTransformer.InputXmlResolver on .NET ignored?
>
> The documentation for InputXmlResolver says that it's used for the URIs supplied to the doc() and document() functions. I don't think that's actually a complete list of the places it is used.
>
> Looking at the code, I think that when we use the Microsoft parser to parse source documents, we set the XmlResolver on that parser, which means it should get used to resolve references to external entities such as DTDs. But I think that when we use the Apache parser (which is the default), we don't: that reflects the Java situation where the resolvers used by the parser and the XSLT processor have quite different interfaces.
>
> The easiest workaround is probably to build the source document yourself using a DocumentBuilder - this is documented (apparently correctly) to use its XmlResolver for resolving external entity references, whichever parser is used. You can then supply the resulting XdmNode to XsltTransformer.InitialContextNode.
>
> Michael Kay
> Saxonica
>
>> On 11 Aug 2016, at 13:44, Emanuel Wlaschitz <[hidden email]> wrote:
>>
>> I'm not exactly sure if this is an issue on our side, so I'm writing to the ML first before creating an issue report.
>>
>> In general, we try to redirect as many calls as possible through our own XmlResolver so we can avoid leaving localhost as much as possible (by probing a series of TR-9401 and XML catalogs, then trying some other well-known locations relative to our assemblies), so our usual pattern looks like this:
>>
>> [...]
>> var transformer = xsltExecutable.Load(); // we try to resolve things
>> on our own here, and fall back to the original resolver (which seems
>> to be an XmlUrlResolver) if we can't transformer.InputXmlResolver =
>> new OurOwnResolver(transformer.InputXmlResolver);
>> transformer.SetInputStream(fileStream, fileUri); var destination =
>> PrepareDestination(...); transformer.Run(destination);
>>
>> ...at least we thought we were.
>> At one point, our internet connection was a little flaky and random transformations started to fail. Looking closer, the affected input XML files looked a little like this:
>>
>> <!DOCTYPE root [
>>  <!ENTITY % something PUBLIC "-//SOME//ENTITIES Formal Public Identifier//EN//XML" "http://some.server.tld/entities/fpi">
>>  %something;
>> ]>
>> <root.../>
>>
>> Both "-//SOME//ENTITIES Formal Public Identifier//EN//XML" and "http://some.server.tld/entities/fpi" were part of our internal probing and should have been redirected to local copies, but in fact it was trying to download the file from http://some.server.tld and failed for the lack of a working connection at that time.
>>
>> With some debugging, we found out that even though the input resolver is set, it isn't being called. The StackTrace looks like this (dumbed down to the classes it goes thru, I can add the full one if necessary):
>>> Our Code that calls transformer.Run
>>> XsltTransformer.Run
>>> net.sf.saxon.Controller/net.sf.saxon.event.Sender
>>> org.apache.xerces.jaxp.SAXParserImpl.JAXPSAXParser/org.apache.xerces.
>>> parsers.XMLParser
>>> org.apache.xerces.impl.XMLDocumentFragmentScannerImpl/org.apache.xerc
>>> es.impl.XMLDTDScannerImpl org.apache.xerces.impl.XMLEntityManager
>>> sun.net.www.protocol.http.HttpURLConnection ->
>>> java.io.FileNotFoundException
>>
>> The resolver is in fact called once, but with the Stylesheet source as Base Uri and a relative path being loaded using the document() function (based on the stylesheet path, not related to the source XML at all).
>> No calls are being made for the XML itself, even when I add a Public or System ID.
>>
>> This is using Saxon-HE 9.6.0.6 running on .NET 4.5; and I'm pretty sure this should have been working already (otherwise we would've chosen a different approach there, for example using XsltTransformer.InitialContextNode instead of XsltTransformer.SetInputStream if this wasn't working before)...then again, that part of the code has been around since 9.4 running on .NET 3.5 (or even earlier).
>> And apparently, this still happens with Saxon-HE 9.7.0.7 (which seems to be the most recent one available on SourceForge).
>>
>> Saxon issue, or am I missing something?
>>
>> Thanks for reading.
>> Regards, Emanuel
>> ----------------------------------------------------------------------
>> -------- What NetFlow Analyzer can do for you? Monitors network
>> bandwidth and traffic patterns at an interface-level. Reveals which
>> users, apps, and protocols are consuming the most bandwidth. Provides
>> multi-vendor support for NetFlow, J-Flow, sFlow and other flows. Make
>> informed decisions using capacity planning reports.
>> http://sdm.link/zohodev2dev 
>> _______________________________________________
>> saxon-help mailing list archived at http://saxon.markmail.org/ 
>> [hidden email]
>> https://lists.sourceforge.net/lists/listinfo/saxon-help
>
>
>
> ------------------------------------------------------------------------------
> What NetFlow Analyzer can do for you? Monitors network bandwidth and traffic patterns at an interface-level. Reveals which users, apps, and protocols are consuming the most bandwidth. Provides multi-vendor support for NetFlow, J-Flow, sFlow and other flows. Make informed decisions using capacity planning reports. http://sdm.link/zohodev2dev _______________________________________________
> saxon-help mailing list archived at http://saxon.markmail.org/ [hidden email] https://lists.sourceforge.net/lists/listinfo/saxon-help 
> ------------------------------------------------------------------------------
> What NetFlow Analyzer can do for you? Monitors network bandwidth and traffic
> patterns at an interface-level. Reveals which users, apps, and protocols are
> consuming the most bandwidth. Provides multi-vendor support for NetFlow,
> J-Flow, sFlow and other flows. Make informed decisions using capacity
> planning reports. http://sdm.link/zohodev2dev
> _______________________________________________
> saxon-help mailing list archived at http://saxon.markmail.org/
> [hidden email]
> https://lists.sourceforge.net/lists/listinfo/saxon-help 



------------------------------------------------------------------------------
What NetFlow Analyzer can do for you? Monitors network bandwidth and traffic
patterns at an interface-level. Reveals which users, apps, and protocols are
consuming the most bandwidth. Provides multi-vendor support for NetFlow,
J-Flow, sFlow and other flows. Make informed decisions using capacity
planning reports. http://sdm.link/zohodev2dev
_______________________________________________
saxon-help mailing list archived at http://saxon.markmail.org/
[hidden email]
https://lists.sourceforge.net/lists/listinfo/saxon-help 
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: XsltTransformer.InputXmlResolver on .NET ignored?

Emanuel Wlaschitz
Your workaround doesn't compile for me.
org.xml.sax has no stream Namespace, and the only suggestion Visual Studio brings up is javax.xml.transform.stream.StreamSource (which causes the transformation to fail with "java.lang.IllegalArgumentException: A source of class class javax.xml.transform.stream.StreamSource is not recognized by any registered object model").
I got both saxon9he and saxon9he-api referenced, as well as all the IKVM assemblies (for good measure).

I also dug out the old code I wrote back then, and threw some Saxon-HE versions at it. The results look like this (same stylesheet as back then, so it still includes xsl:strip-space), running every strategy multiple times and taking the average time:

Saxon-HE 9.4.0.6N
SetOutputStream: avg 00:00:16.9036687
InitialContextNode: avg 00:02:11.1667005
(so, ~17 seconds vs. >2 minutes)

Saxon-HE 9.5.1.4N
SetOutputStream: avg 00:00:17.9912143
InitialContextNode: avg 00:02:03.4323418
(again, ~18 seconds vs. >2 minutes; but slightly better)

Saxon-HE 9.6.0.6N
SetOutputStream: avg 00:00:17.3243900
InitialContextNode: avg 00:01:51.8031111
(getting better, ~17 seconds vs. a little under 2 minutes)

Saxon-HE 9.7.0.7N
SetOutputStream: avg 00:00:18.1995258
InitialContextNode: avg 00:02:53.9077358
(getting slower again? ~18 seconds vs. almost 3 minutes...I actually ran this one twice just to check if something threw off the results)

All tests ran on .NET 4.5.2 as console application (Any CPU, Prefer32Bit), same base code, compiled with Visual Studio 2015 Update 3, the only thing changed in between was the referenced libraries (followed by a clean/rebuild). Disabling Prefer32Bit (so that it runs as 64-bit application) actually makes things worse here (SetOutputStream avg 21 seconds vs. InitialContextNode avg 3 minutes 47 seconds).
Parts of the stylesheet still rely on xsl:strip-space doing its work (and in some cases, it seems to be tricky to get rid of due to pretty-printing in the source and other insignificant whitespace that we can't do anything about), and we cannot be certain that other stylesheets (beyond our control) use or don't use that.
And after all, performance is key because nobody likes to wait. Whether it should be considered a bug...I don't think so, mostly for the fact that it always worked like that, and still does; with the caveat of being slow.

Although, I think we might be able to do something else...we do some pre-processing of the input XML already and could add another step that simply drops the DOCTYPE declaration, since we know by that time that the input is (well, should be...) both well-formed and valid against the xsd/dtd, plus has character entities resolved into numeric entities.

Regards, Emanuel

-----Original Message-----
From: Michael Kay [mailto:[hidden email]]
Sent: Thursday, August 11, 2016 19:10
To: Mailing list for the SAXON XSLT and XQuery processor <[hidden email]>
Subject: Re: [saxon] XsltTransformer.InputXmlResolver on .NET ignored?

OK, thanks, I had forgotten that bit of history. I've just found a 2013 thread where we identified that doing whitespace stripping on the fly was much faster than stripping it dynamically from an existing tree, and that could be the issue you're talking about. If you really need to have whitespace-stripping rules more complex than those offered by the DocumentBuilder (all, ignorable, or none) then I see this could be an issue with this workaround.

Any other workarounds I can think of involve diving down into rather low-level (Java-oriented) APIs. For example you could try

net.sf.saxon.om.NodeInfo doc =
   transformer.Implementation.prepareInputTree(
     new org.xml.sax.stream.StreamSource(
       new net.sf.saxon.dotnet.DotNetInputStream(stream), baseUri)); transformer.InitialContextNode = (XdmNode)XdmValue.wrap(doc);

The other thing we need to think about is whether to treat this as a bug. If it's a change between 9.6 and 9.7 then we should definitely do so. If the current behaviour has been in the product for some time, then given that it's consistent with the documentation, it would be safer not to change it.

Michael Kay
Saxonica


> On 11 Aug 2016, at 15:23, Emanuel Wlaschitz <[hidden email]> wrote:
>
> We're doing this in other places already (mostly for in-memory transformations), but replaced it for a few cases where we had files available. Not sure if you remember, but we had some performance issues using InitialContextNode before (where O'Neil investigated, on 9.4 I believe?) so we switched to SetInputStream for a huge performance improvement.
> I'll see if I can find the old samples to re-run them on 9.6 and 9.7 to find out whether this is still true or not. It did make a difference by hours on larger inputs (>100MB) back then; if it is negligible now I suppose we can switch back again.
>
> Thanks for the info!
>
> Regards, Emanuel
>
> -----Original Message-----
> From: Michael Kay [mailto:[hidden email]]
> Sent: Thursday, August 11, 2016 15:56
> To: Mailing list for the SAXON XSLT and XQuery processor
> <[hidden email]>
> Subject: Re: [saxon] XsltTransformer.InputXmlResolver on .NET ignored?
>
> The documentation for InputXmlResolver says that it's used for the URIs supplied to the doc() and document() functions. I don't think that's actually a complete list of the places it is used.
>
> Looking at the code, I think that when we use the Microsoft parser to parse source documents, we set the XmlResolver on that parser, which means it should get used to resolve references to external entities such as DTDs. But I think that when we use the Apache parser (which is the default), we don't: that reflects the Java situation where the resolvers used by the parser and the XSLT processor have quite different interfaces.
>
> The easiest workaround is probably to build the source document yourself using a DocumentBuilder - this is documented (apparently correctly) to use its XmlResolver for resolving external entity references, whichever parser is used. You can then supply the resulting XdmNode to XsltTransformer.InitialContextNode.
>
> Michael Kay
> Saxonica
>
>> On 11 Aug 2016, at 13:44, Emanuel Wlaschitz <[hidden email]> wrote:
>>
>> I'm not exactly sure if this is an issue on our side, so I'm writing to the ML first before creating an issue report.
>>
>> In general, we try to redirect as many calls as possible through our own XmlResolver so we can avoid leaving localhost as much as possible (by probing a series of TR-9401 and XML catalogs, then trying some other well-known locations relative to our assemblies), so our usual pattern looks like this:
>>
>> [...]
>> var transformer = xsltExecutable.Load(); // we try to resolve things
>> on our own here, and fall back to the original resolver (which seems
>> to be an XmlUrlResolver) if we can't transformer.InputXmlResolver =
>> new OurOwnResolver(transformer.InputXmlResolver);
>> transformer.SetInputStream(fileStream, fileUri); var destination =
>> PrepareDestination(...); transformer.Run(destination);
>>
>> ...at least we thought we were.
>> At one point, our internet connection was a little flaky and random transformations started to fail. Looking closer, the affected input XML files looked a little like this:
>>
>> <!DOCTYPE root [
>>  <!ENTITY % something PUBLIC "-//SOME//ENTITIES Formal Public
>> Identifier//EN//XML" "http://some.server.tld/entities/fpi">
>>  %something;
>> ]>
>> <root.../>
>>
>> Both "-//SOME//ENTITIES Formal Public Identifier//EN//XML" and "http://some.server.tld/entities/fpi" were part of our internal probing and should have been redirected to local copies, but in fact it was trying to download the file from http://some.server.tld and failed for the lack of a working connection at that time.
>>
>> With some debugging, we found out that even though the input resolver is set, it isn't being called. The StackTrace looks like this (dumbed down to the classes it goes thru, I can add the full one if necessary):
>>> Our Code that calls transformer.Run
>>> XsltTransformer.Run
>>> net.sf.saxon.Controller/net.sf.saxon.event.Sender
>>> org.apache.xerces.jaxp.SAXParserImpl.JAXPSAXParser/org.apache.xerces.
>>> parsers.XMLParser
>>> org.apache.xerces.impl.XMLDocumentFragmentScannerImpl/org.apache.xer
>>> c es.impl.XMLDTDScannerImpl org.apache.xerces.impl.XMLEntityManager
>>> sun.net.www.protocol.http.HttpURLConnection ->
>>> java.io.FileNotFoundException
>>
>> The resolver is in fact called once, but with the Stylesheet source as Base Uri and a relative path being loaded using the document() function (based on the stylesheet path, not related to the source XML at all).
>> No calls are being made for the XML itself, even when I add a Public or System ID.
>>
>> This is using Saxon-HE 9.6.0.6 running on .NET 4.5; and I'm pretty sure this should have been working already (otherwise we would've chosen a different approach there, for example using XsltTransformer.InitialContextNode instead of XsltTransformer.SetInputStream if this wasn't working before)...then again, that part of the code has been around since 9.4 running on .NET 3.5 (or even earlier).
>> And apparently, this still happens with Saxon-HE 9.7.0.7 (which seems to be the most recent one available on SourceForge).
>>
>> Saxon issue, or am I missing something?
>>
>> Thanks for reading.
>> Regards, Emanuel
>> ---------------------------------------------------------------------
>> -
>> -------- What NetFlow Analyzer can do for you? Monitors network
>> bandwidth and traffic patterns at an interface-level. Reveals which
>> users, apps, and protocols are consuming the most bandwidth. Provides
>> multi-vendor support for NetFlow, J-Flow, sFlow and other flows. Make
>> informed decisions using capacity planning reports.
>> http://sdm.link/zohodev2dev
>> _______________________________________________
>> saxon-help mailing list archived at http://saxon.markmail.org/ 
>> [hidden email]
>> https://lists.sourceforge.net/lists/listinfo/saxon-help
>
>
>
> ----------------------------------------------------------------------
> -------- What NetFlow Analyzer can do for you? Monitors network
> bandwidth and traffic patterns at an interface-level. Reveals which
> users, apps, and protocols are consuming the most bandwidth. Provides
> multi-vendor support for NetFlow, J-Flow, sFlow and other flows. Make
> informed decisions using capacity planning reports.
> http://sdm.link/zohodev2dev 
> _______________________________________________
> saxon-help mailing list archived at http://saxon.markmail.org/ 
> [hidden email]
> https://lists.sourceforge.net/lists/listinfo/saxon-help
> ----------------------------------------------------------------------
> -------- What NetFlow Analyzer can do for you? Monitors network
> bandwidth and traffic patterns at an interface-level. Reveals which
> users, apps, and protocols are consuming the most bandwidth. Provides
> multi-vendor support for NetFlow, J-Flow, sFlow and other flows. Make
> informed decisions using capacity planning reports.
> http://sdm.link/zohodev2dev 
> _______________________________________________
> saxon-help mailing list archived at http://saxon.markmail.org/ 
> [hidden email]
> https://lists.sourceforge.net/lists/listinfo/saxon-help



------------------------------------------------------------------------------
What NetFlow Analyzer can do for you? Monitors network bandwidth and traffic patterns at an interface-level. Reveals which users, apps, and protocols are consuming the most bandwidth. Provides multi-vendor support for NetFlow, J-Flow, sFlow and other flows. Make informed decisions using capacity planning reports. http://sdm.link/zohodev2dev _______________________________________________
saxon-help mailing list archived at http://saxon.markmail.org/ [hidden email] https://lists.sourceforge.net/lists/listinfo/saxon-help 
------------------------------------------------------------------------------
What NetFlow Analyzer can do for you? Monitors network bandwidth and traffic
patterns at an interface-level. Reveals which users, apps, and protocols are
consuming the most bandwidth. Provides multi-vendor support for NetFlow,
J-Flow, sFlow and other flows. Make informed decisions using capacity
planning reports. http://sdm.link/zohodev2dev
_______________________________________________
saxon-help mailing list archived at http://saxon.markmail.org/
[hidden email]
https://lists.sourceforge.net/lists/listinfo/saxon-help 
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: XsltTransformer.InputXmlResolver on .NET ignored?

Michael Kay
OK, sorry, it was only an idea, not tested working code.

StreamSource is actually javax.xml.transform.stream.StreamSource.

But I missed the documentation of Controller.prepareInputTree, which says

@param source the input tree. Must be either a DOMSource or a NodeInfo

Incidentally we have an open bug that might be related: https://saxonica.plan.io/issues/2729

It would be useful if you could let us have a repro for the runs below so we can investigate. Dynamic space stripping on a DOM is always going to be slower than space stripping during parsing, but I can't see why it should have got worse in 9.7. I've been thinking for a while that we should offer the option to do in-situ space stripping on the DOM - that is, we actually modify the supplied input DOM - which is what we do in Saxon-JS.

I think we should try and look for a solution that enables you to use the Apache parser with both on-the-fly whitespace stripping and a user-supplied resolver for external entities.

Michael Kay
Saxonica

> On 12 Aug 2016, at 07:53, Emanuel Wlaschitz <[hidden email]> wrote:
>
> Your workaround doesn't compile for me.
> org.xml.sax has no stream Namespace, and the only suggestion Visual Studio brings up is javax.xml.transform.stream.StreamSource (which causes the transformation to fail with "java.lang.IllegalArgumentException: A source of class class javax.xml.transform.stream.StreamSource is not recognized by any registered object model").
> I got both saxon9he and saxon9he-api referenced, as well as all the IKVM assemblies (for good measure).
>
> I also dug out the old code I wrote back then, and threw some Saxon-HE versions at it. The results look like this (same stylesheet as back then, so it still includes xsl:strip-space), running every strategy multiple times and taking the average time:
>
> Saxon-HE 9.4.0.6N
> SetOutputStream: avg 00:00:16.9036687
> InitialContextNode: avg 00:02:11.1667005
> (so, ~17 seconds vs. >2 minutes)
>
> Saxon-HE 9.5.1.4N
> SetOutputStream: avg 00:00:17.9912143
> InitialContextNode: avg 00:02:03.4323418
> (again, ~18 seconds vs. >2 minutes; but slightly better)
>
> Saxon-HE 9.6.0.6N
> SetOutputStream: avg 00:00:17.3243900
> InitialContextNode: avg 00:01:51.8031111
> (getting better, ~17 seconds vs. a little under 2 minutes)
>
> Saxon-HE 9.7.0.7N
> SetOutputStream: avg 00:00:18.1995258
> InitialContextNode: avg 00:02:53.9077358
> (getting slower again? ~18 seconds vs. almost 3 minutes...I actually ran this one twice just to check if something threw off the results)
>
> All tests ran on .NET 4.5.2 as console application (Any CPU, Prefer32Bit), same base code, compiled with Visual Studio 2015 Update 3, the only thing changed in between was the referenced libraries (followed by a clean/rebuild). Disabling Prefer32Bit (so that it runs as 64-bit application) actually makes things worse here (SetOutputStream avg 21 seconds vs. InitialContextNode avg 3 minutes 47 seconds).
> Parts of the stylesheet still rely on xsl:strip-space doing its work (and in some cases, it seems to be tricky to get rid of due to pretty-printing in the source and other insignificant whitespace that we can't do anything about), and we cannot be certain that other stylesheets (beyond our control) use or don't use that.
> And after all, performance is key because nobody likes to wait. Whether it should be considered a bug...I don't think so, mostly for the fact that it always worked like that, and still does; with the caveat of being slow.
>
> Although, I think we might be able to do something else...we do some pre-processing of the input XML already and could add another step that simply drops the DOCTYPE declaration, since we know by that time that the input is (well, should be...) both well-formed and valid against the xsd/dtd, plus has character entities resolved into numeric entities.
>
> Regards, Emanuel
>
> -----Original Message-----
> From: Michael Kay [mailto:[hidden email]]
> Sent: Thursday, August 11, 2016 19:10
> To: Mailing list for the SAXON XSLT and XQuery processor <[hidden email]>
> Subject: Re: [saxon] XsltTransformer.InputXmlResolver on .NET ignored?
>
> OK, thanks, I had forgotten that bit of history. I've just found a 2013 thread where we identified that doing whitespace stripping on the fly was much faster than stripping it dynamically from an existing tree, and that could be the issue you're talking about. If you really need to have whitespace-stripping rules more complex than those offered by the DocumentBuilder (all, ignorable, or none) then I see this could be an issue with this workaround.
>
> Any other workarounds I can think of involve diving down into rather low-level (Java-oriented) APIs. For example you could try
>
> net.sf.saxon.om.NodeInfo doc =
>   transformer.Implementation.prepareInputTree(
>     new org.xml.sax.stream.StreamSource(
>       new net.sf.saxon.dotnet.DotNetInputStream(stream), baseUri)); transformer.InitialContextNode = (XdmNode)XdmValue.wrap(doc);
>
> The other thing we need to think about is whether to treat this as a bug. If it's a change between 9.6 and 9.7 then we should definitely do so. If the current behaviour has been in the product for some time, then given that it's consistent with the documentation, it would be safer not to change it.
>
> Michael Kay
> Saxonica
>
>
>> On 11 Aug 2016, at 15:23, Emanuel Wlaschitz <[hidden email]> wrote:
>>
>> We're doing this in other places already (mostly for in-memory transformations), but replaced it for a few cases where we had files available. Not sure if you remember, but we had some performance issues using InitialContextNode before (where O'Neil investigated, on 9.4 I believe?) so we switched to SetInputStream for a huge performance improvement.
>> I'll see if I can find the old samples to re-run them on 9.6 and 9.7 to find out whether this is still true or not. It did make a difference by hours on larger inputs (>100MB) back then; if it is negligible now I suppose we can switch back again.
>>
>> Thanks for the info!
>>
>> Regards, Emanuel
>>
>> -----Original Message-----
>> From: Michael Kay [mailto:[hidden email]]
>> Sent: Thursday, August 11, 2016 15:56
>> To: Mailing list for the SAXON XSLT and XQuery processor
>> <[hidden email]>
>> Subject: Re: [saxon] XsltTransformer.InputXmlResolver on .NET ignored?
>>
>> The documentation for InputXmlResolver says that it's used for the URIs supplied to the doc() and document() functions. I don't think that's actually a complete list of the places it is used.
>>
>> Looking at the code, I think that when we use the Microsoft parser to parse source documents, we set the XmlResolver on that parser, which means it should get used to resolve references to external entities such as DTDs. But I think that when we use the Apache parser (which is the default), we don't: that reflects the Java situation where the resolvers used by the parser and the XSLT processor have quite different interfaces.
>>
>> The easiest workaround is probably to build the source document yourself using a DocumentBuilder - this is documented (apparently correctly) to use its XmlResolver for resolving external entity references, whichever parser is used. You can then supply the resulting XdmNode to XsltTransformer.InitialContextNode.
>>
>> Michael Kay
>> Saxonica
>>
>>> On 11 Aug 2016, at 13:44, Emanuel Wlaschitz <[hidden email]> wrote:
>>>
>>> I'm not exactly sure if this is an issue on our side, so I'm writing to the ML first before creating an issue report.
>>>
>>> In general, we try to redirect as many calls as possible through our own XmlResolver so we can avoid leaving localhost as much as possible (by probing a series of TR-9401 and XML catalogs, then trying some other well-known locations relative to our assemblies), so our usual pattern looks like this:
>>>
>>> [...]
>>> var transformer = xsltExecutable.Load(); // we try to resolve things
>>> on our own here, and fall back to the original resolver (which seems
>>> to be an XmlUrlResolver) if we can't transformer.InputXmlResolver =
>>> new OurOwnResolver(transformer.InputXmlResolver);
>>> transformer.SetInputStream(fileStream, fileUri); var destination =
>>> PrepareDestination(...); transformer.Run(destination);
>>>
>>> ...at least we thought we were.
>>> At one point, our internet connection was a little flaky and random transformations started to fail. Looking closer, the affected input XML files looked a little like this:
>>>
>>> <!DOCTYPE root [
>>> <!ENTITY % something PUBLIC "-//SOME//ENTITIES Formal Public
>>> Identifier//EN//XML" "http://some.server.tld/entities/fpi">
>>> %something;
>>> ]>
>>> <root.../>
>>>
>>> Both "-//SOME//ENTITIES Formal Public Identifier//EN//XML" and "http://some.server.tld/entities/fpi" were part of our internal probing and should have been redirected to local copies, but in fact it was trying to download the file from http://some.server.tld and failed for the lack of a working connection at that time.
>>>
>>> With some debugging, we found out that even though the input resolver is set, it isn't being called. The StackTrace looks like this (dumbed down to the classes it goes thru, I can add the full one if necessary):
>>>> Our Code that calls transformer.Run
>>>> XsltTransformer.Run
>>>> net.sf.saxon.Controller/net.sf.saxon.event.Sender
>>>> org.apache.xerces.jaxp.SAXParserImpl.JAXPSAXParser/org.apache.xerces.
>>>> parsers.XMLParser
>>>> org.apache.xerces.impl.XMLDocumentFragmentScannerImpl/org.apache.xer
>>>> c es.impl.XMLDTDScannerImpl org.apache.xerces.impl.XMLEntityManager
>>>> sun.net.www.protocol.http.HttpURLConnection ->
>>>> java.io.FileNotFoundException
>>>
>>> The resolver is in fact called once, but with the Stylesheet source as Base Uri and a relative path being loaded using the document() function (based on the stylesheet path, not related to the source XML at all).
>>> No calls are being made for the XML itself, even when I add a Public or System ID.
>>>
>>> This is using Saxon-HE 9.6.0.6 running on .NET 4.5; and I'm pretty sure this should have been working already (otherwise we would've chosen a different approach there, for example using XsltTransformer.InitialContextNode instead of XsltTransformer.SetInputStream if this wasn't working before)...then again, that part of the code has been around since 9.4 running on .NET 3.5 (or even earlier).
>>> And apparently, this still happens with Saxon-HE 9.7.0.7 (which seems to be the most recent one available on SourceForge).
>>>
>>> Saxon issue, or am I missing something?
>>>
>>> Thanks for reading.
>>> Regards, Emanuel
>>> ---------------------------------------------------------------------
>>> -
>>> -------- What NetFlow Analyzer can do for you? Monitors network
>>> bandwidth and traffic patterns at an interface-level. Reveals which
>>> users, apps, and protocols are consuming the most bandwidth. Provides
>>> multi-vendor support for NetFlow, J-Flow, sFlow and other flows. Make
>>> informed decisions using capacity planning reports.
>>> http://sdm.link/zohodev2dev
>>> _______________________________________________
>>> saxon-help mailing list archived at http://saxon.markmail.org/ 
>>> [hidden email]
>>> https://lists.sourceforge.net/lists/listinfo/saxon-help
>>
>>
>>
>> ----------------------------------------------------------------------
>> -------- What NetFlow Analyzer can do for you? Monitors network
>> bandwidth and traffic patterns at an interface-level. Reveals which
>> users, apps, and protocols are consuming the most bandwidth. Provides
>> multi-vendor support for NetFlow, J-Flow, sFlow and other flows. Make
>> informed decisions using capacity planning reports.
>> http://sdm.link/zohodev2dev 
>> _______________________________________________
>> saxon-help mailing list archived at http://saxon.markmail.org/ 
>> [hidden email]
>> https://lists.sourceforge.net/lists/listinfo/saxon-help
>> ----------------------------------------------------------------------
>> -------- What NetFlow Analyzer can do for you? Monitors network
>> bandwidth and traffic patterns at an interface-level. Reveals which
>> users, apps, and protocols are consuming the most bandwidth. Provides
>> multi-vendor support for NetFlow, J-Flow, sFlow and other flows. Make
>> informed decisions using capacity planning reports.
>> http://sdm.link/zohodev2dev 
>> _______________________________________________
>> saxon-help mailing list archived at http://saxon.markmail.org/ 
>> [hidden email]
>> https://lists.sourceforge.net/lists/listinfo/saxon-help
>
>
>
> ------------------------------------------------------------------------------
> What NetFlow Analyzer can do for you? Monitors network bandwidth and traffic patterns at an interface-level. Reveals which users, apps, and protocols are consuming the most bandwidth. Provides multi-vendor support for NetFlow, J-Flow, sFlow and other flows. Make informed decisions using capacity planning reports. http://sdm.link/zohodev2dev _______________________________________________
> saxon-help mailing list archived at http://saxon.markmail.org/ [hidden email] https://lists.sourceforge.net/lists/listinfo/saxon-help 
> ------------------------------------------------------------------------------
> What NetFlow Analyzer can do for you? Monitors network bandwidth and traffic
> patterns at an interface-level. Reveals which users, apps, and protocols are
> consuming the most bandwidth. Provides multi-vendor support for NetFlow,
> J-Flow, sFlow and other flows. Make informed decisions using capacity
> planning reports. http://sdm.link/zohodev2dev
> _______________________________________________
> saxon-help mailing list archived at http://saxon.markmail.org/
> [hidden email]
> https://lists.sourceforge.net/lists/listinfo/saxon-help 



------------------------------------------------------------------------------
What NetFlow Analyzer can do for you? Monitors network bandwidth and traffic
patterns at an interface-level. Reveals which users, apps, and protocols are
consuming the most bandwidth. Provides multi-vendor support for NetFlow,
J-Flow, sFlow and other flows. Make informed decisions using capacity
planning reports. http://sdm.link/zohodev2dev
_______________________________________________
saxon-help mailing list archived at http://saxon.markmail.org/
[hidden email]
https://lists.sourceforge.net/lists/listinfo/saxon-help 
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: XsltTransformer.InputXmlResolver on .NET ignored?

Emanuel Wlaschitz
I suppose the main reason why you don't seem to see this that often is that we have an apparently unique combination of larger XML files (that is, a few MB upwards, averaging around 5-20MB and peaks at over 200MB) with fairly complex XSLTs (that particular case, split into about 30 different modules being xsl:import'ed for about ~600KB on disk) that include <xsl:strip-space elements="*"/> as instruction, with <xsl:preserve-space elements="para"/> (the common paragraph element for those types of input, which is the only mixed content element we have to care about).

I can't share the test case XML and XSLT with you, but you can use the DocBook samples from back then again.
Using those, I get negligible differences with the original files, but the InitialContextNode times go up as I start duplicating the <part> elements to bring the input file size up a little (doing it twice gives about 3 seconds difference on my machine); and up again as I add <xsl:strip-space elements="*"/> in the docbook.xsl file (about a second or two more than just duplicating the <part> elements).
9.6.0.6 difference compared to 9.7.0.7 is about a second without strip-space and about two seconds with strip-space (9.6 being faster).

Ultimately, being able to get the same resolving benefits for SetInputStream right through the existing Saxon.Api would be the best case here, without having to reach into the implementation or java land.
...and that would certainly be a feature request, since the documentation was rather clear on when the Resolver would be called (and you even added the Note for InitialContextNode and xsl:strip-space being inefficient, awesome!)

Thanks again,
Emanuel

-----Original Message-----
From: Michael Kay [mailto:[hidden email]]
Sent: Friday, August 12, 2016 10:17
To: Mailing list for the SAXON XSLT and XQuery processor <[hidden email]>
Subject: Re: [saxon] XsltTransformer.InputXmlResolver on .NET ignored?

OK, sorry, it was only an idea, not tested working code.

StreamSource is actually javax.xml.transform.stream.StreamSource.

But I missed the documentation of Controller.prepareInputTree, which says

@param source the input tree. Must be either a DOMSource or a NodeInfo

Incidentally we have an open bug that might be related: https://saxonica.plan.io/issues/2729

It would be useful if you could let us have a repro for the runs below so we can investigate. Dynamic space stripping on a DOM is always going to be slower than space stripping during parsing, but I can't see why it should have got worse in 9.7. I've been thinking for a while that we should offer the option to do in-situ space stripping on the DOM - that is, we actually modify the supplied input DOM - which is what we do in Saxon-JS.

I think we should try and look for a solution that enables you to use the Apache parser with both on-the-fly whitespace stripping and a user-supplied resolver for external entities.

Michael Kay
Saxonica

> On 12 Aug 2016, at 07:53, Emanuel Wlaschitz <[hidden email]> wrote:
>
> Your workaround doesn't compile for me.
> org.xml.sax has no stream Namespace, and the only suggestion Visual Studio brings up is javax.xml.transform.stream.StreamSource (which causes the transformation to fail with "java.lang.IllegalArgumentException: A source of class class javax.xml.transform.stream.StreamSource is not recognized by any registered object model").
> I got both saxon9he and saxon9he-api referenced, as well as all the IKVM assemblies (for good measure).
>
> I also dug out the old code I wrote back then, and threw some Saxon-HE versions at it. The results look like this (same stylesheet as back then, so it still includes xsl:strip-space), running every strategy multiple times and taking the average time:
>
> Saxon-HE 9.4.0.6N
> SetOutputStream: avg 00:00:16.9036687
> InitialContextNode: avg 00:02:11.1667005 (so, ~17 seconds vs. >2
> minutes)
>
> Saxon-HE 9.5.1.4N
> SetOutputStream: avg 00:00:17.9912143
> InitialContextNode: avg 00:02:03.4323418 (again, ~18 seconds vs. >2
> minutes; but slightly better)
>
> Saxon-HE 9.6.0.6N
> SetOutputStream: avg 00:00:17.3243900
> InitialContextNode: avg 00:01:51.8031111 (getting better, ~17 seconds
> vs. a little under 2 minutes)
>
> Saxon-HE 9.7.0.7N
> SetOutputStream: avg 00:00:18.1995258
> InitialContextNode: avg 00:02:53.9077358 (getting slower again? ~18
> seconds vs. almost 3 minutes...I actually ran this one twice just to
> check if something threw off the results)
>
> All tests ran on .NET 4.5.2 as console application (Any CPU, Prefer32Bit), same base code, compiled with Visual Studio 2015 Update 3, the only thing changed in between was the referenced libraries (followed by a clean/rebuild). Disabling Prefer32Bit (so that it runs as 64-bit application) actually makes things worse here (SetOutputStream avg 21 seconds vs. InitialContextNode avg 3 minutes 47 seconds).
> Parts of the stylesheet still rely on xsl:strip-space doing its work (and in some cases, it seems to be tricky to get rid of due to pretty-printing in the source and other insignificant whitespace that we can't do anything about), and we cannot be certain that other stylesheets (beyond our control) use or don't use that.
> And after all, performance is key because nobody likes to wait. Whether it should be considered a bug...I don't think so, mostly for the fact that it always worked like that, and still does; with the caveat of being slow.
>
> Although, I think we might be able to do something else...we do some pre-processing of the input XML already and could add another step that simply drops the DOCTYPE declaration, since we know by that time that the input is (well, should be...) both well-formed and valid against the xsd/dtd, plus has character entities resolved into numeric entities.
>
> Regards, Emanuel
>
> -----Original Message-----
> From: Michael Kay [mailto:[hidden email]]
> Sent: Thursday, August 11, 2016 19:10
> To: Mailing list for the SAXON XSLT and XQuery processor
> <[hidden email]>
> Subject: Re: [saxon] XsltTransformer.InputXmlResolver on .NET ignored?
>
> OK, thanks, I had forgotten that bit of history. I've just found a 2013 thread where we identified that doing whitespace stripping on the fly was much faster than stripping it dynamically from an existing tree, and that could be the issue you're talking about. If you really need to have whitespace-stripping rules more complex than those offered by the DocumentBuilder (all, ignorable, or none) then I see this could be an issue with this workaround.
>
> Any other workarounds I can think of involve diving down into rather
> low-level (Java-oriented) APIs. For example you could try
>
> net.sf.saxon.om.NodeInfo doc =
>   transformer.Implementation.prepareInputTree(
>     new org.xml.sax.stream.StreamSource(
>       new net.sf.saxon.dotnet.DotNetInputStream(stream), baseUri));
> transformer.InitialContextNode = (XdmNode)XdmValue.wrap(doc);
>
> The other thing we need to think about is whether to treat this as a bug. If it's a change between 9.6 and 9.7 then we should definitely do so. If the current behaviour has been in the product for some time, then given that it's consistent with the documentation, it would be safer not to change it.
>
> Michael Kay
> Saxonica
>
>
>> On 11 Aug 2016, at 15:23, Emanuel Wlaschitz <[hidden email]> wrote:
>>
>> We're doing this in other places already (mostly for in-memory transformations), but replaced it for a few cases where we had files available. Not sure if you remember, but we had some performance issues using InitialContextNode before (where O'Neil investigated, on 9.4 I believe?) so we switched to SetInputStream for a huge performance improvement.
>> I'll see if I can find the old samples to re-run them on 9.6 and 9.7 to find out whether this is still true or not. It did make a difference by hours on larger inputs (>100MB) back then; if it is negligible now I suppose we can switch back again.
>>
>> Thanks for the info!
>>
>> Regards, Emanuel
>>
>> -----Original Message-----
>> From: Michael Kay [mailto:[hidden email]]
>> Sent: Thursday, August 11, 2016 15:56
>> To: Mailing list for the SAXON XSLT and XQuery processor
>> <[hidden email]>
>> Subject: Re: [saxon] XsltTransformer.InputXmlResolver on .NET ignored?
>>
>> The documentation for InputXmlResolver says that it's used for the URIs supplied to the doc() and document() functions. I don't think that's actually a complete list of the places it is used.
>>
>> Looking at the code, I think that when we use the Microsoft parser to parse source documents, we set the XmlResolver on that parser, which means it should get used to resolve references to external entities such as DTDs. But I think that when we use the Apache parser (which is the default), we don't: that reflects the Java situation where the resolvers used by the parser and the XSLT processor have quite different interfaces.
>>
>> The easiest workaround is probably to build the source document yourself using a DocumentBuilder - this is documented (apparently correctly) to use its XmlResolver for resolving external entity references, whichever parser is used. You can then supply the resulting XdmNode to XsltTransformer.InitialContextNode.
>>
>> Michael Kay
>> Saxonica
>>
>>> On 11 Aug 2016, at 13:44, Emanuel Wlaschitz <[hidden email]> wrote:
>>>
>>> I'm not exactly sure if this is an issue on our side, so I'm writing to the ML first before creating an issue report.
>>>
>>> In general, we try to redirect as many calls as possible through our own XmlResolver so we can avoid leaving localhost as much as possible (by probing a series of TR-9401 and XML catalogs, then trying some other well-known locations relative to our assemblies), so our usual pattern looks like this:
>>>
>>> [...]
>>> var transformer = xsltExecutable.Load(); // we try to resolve things
>>> on our own here, and fall back to the original resolver (which seems
>>> to be an XmlUrlResolver) if we can't transformer.InputXmlResolver =
>>> new OurOwnResolver(transformer.InputXmlResolver);
>>> transformer.SetInputStream(fileStream, fileUri); var destination =
>>> PrepareDestination(...); transformer.Run(destination);
>>>
>>> ...at least we thought we were.
>>> At one point, our internet connection was a little flaky and random transformations started to fail. Looking closer, the affected input XML files looked a little like this:
>>>
>>> <!DOCTYPE root [
>>> <!ENTITY % something PUBLIC "-//SOME//ENTITIES Formal Public
>>> Identifier//EN//XML" "http://some.server.tld/entities/fpi">
>>> %something;
>>> ]>
>>> <root.../>
>>>
>>> Both "-//SOME//ENTITIES Formal Public Identifier//EN//XML" and "http://some.server.tld/entities/fpi" were part of our internal probing and should have been redirected to local copies, but in fact it was trying to download the file from http://some.server.tld and failed for the lack of a working connection at that time.
>>>
>>> With some debugging, we found out that even though the input resolver is set, it isn't being called. The StackTrace looks like this (dumbed down to the classes it goes thru, I can add the full one if necessary):
>>>> Our Code that calls transformer.Run XsltTransformer.Run
>>>> net.sf.saxon.Controller/net.sf.saxon.event.Sender
>>>> org.apache.xerces.jaxp.SAXParserImpl.JAXPSAXParser/org.apache.xerces.
>>>> parsers.XMLParser
>>>> org.apache.xerces.impl.XMLDocumentFragmentScannerImpl/org.apache.xe
>>>> r c es.impl.XMLDTDScannerImpl
>>>> org.apache.xerces.impl.XMLEntityManager
>>>> sun.net.www.protocol.http.HttpURLConnection ->
>>>> java.io.FileNotFoundException
>>>
>>> The resolver is in fact called once, but with the Stylesheet source as Base Uri and a relative path being loaded using the document() function (based on the stylesheet path, not related to the source XML at all).
>>> No calls are being made for the XML itself, even when I add a Public or System ID.
>>>
>>> This is using Saxon-HE 9.6.0.6 running on .NET 4.5; and I'm pretty sure this should have been working already (otherwise we would've chosen a different approach there, for example using XsltTransformer.InitialContextNode instead of XsltTransformer.SetInputStream if this wasn't working before)...then again, that part of the code has been around since 9.4 running on .NET 3.5 (or even earlier).
>>> And apparently, this still happens with Saxon-HE 9.7.0.7 (which seems to be the most recent one available on SourceForge).
>>>
>>> Saxon issue, or am I missing something?
>>>
>>> Thanks for reading.
>>> Regards, Emanuel
>>> --------------------------------------------------------------------
>>> -
>>> -
>>> -------- What NetFlow Analyzer can do for you? Monitors network
>>> bandwidth and traffic patterns at an interface-level. Reveals which
>>> users, apps, and protocols are consuming the most bandwidth.
>>> Provides multi-vendor support for NetFlow, J-Flow, sFlow and other
>>> flows. Make informed decisions using capacity planning reports.
>>> http://sdm.link/zohodev2dev
>>> _______________________________________________
>>> saxon-help mailing list archived at http://saxon.markmail.org/ 
>>> [hidden email]
>>> https://lists.sourceforge.net/lists/listinfo/saxon-help
>>
>>
>>
>> ---------------------------------------------------------------------
>> -
>> -------- What NetFlow Analyzer can do for you? Monitors network
>> bandwidth and traffic patterns at an interface-level. Reveals which
>> users, apps, and protocols are consuming the most bandwidth. Provides
>> multi-vendor support for NetFlow, J-Flow, sFlow and other flows. Make
>> informed decisions using capacity planning reports.
>> http://sdm.link/zohodev2dev
>> _______________________________________________
>> saxon-help mailing list archived at http://saxon.markmail.org/ 
>> [hidden email]
>> https://lists.sourceforge.net/lists/listinfo/saxon-help
>> ---------------------------------------------------------------------
>> -
>> -------- What NetFlow Analyzer can do for you? Monitors network
>> bandwidth and traffic patterns at an interface-level. Reveals which
>> users, apps, and protocols are consuming the most bandwidth. Provides
>> multi-vendor support for NetFlow, J-Flow, sFlow and other flows. Make
>> informed decisions using capacity planning reports.
>> http://sdm.link/zohodev2dev
>> _______________________________________________
>> saxon-help mailing list archived at http://saxon.markmail.org/ 
>> [hidden email]
>> https://lists.sourceforge.net/lists/listinfo/saxon-help
>
>
>
> ----------------------------------------------------------------------
> -------- What NetFlow Analyzer can do for you? Monitors network
> bandwidth and traffic patterns at an interface-level. Reveals which
> users, apps, and protocols are consuming the most bandwidth. Provides
> multi-vendor support for NetFlow, J-Flow, sFlow and other flows. Make
> informed decisions using capacity planning reports.
> http://sdm.link/zohodev2dev 
> _______________________________________________
> saxon-help mailing list archived at http://saxon.markmail.org/ 
> [hidden email]
> https://lists.sourceforge.net/lists/listinfo/saxon-help
> ----------------------------------------------------------------------
> -------- What NetFlow Analyzer can do for you? Monitors network
> bandwidth and traffic patterns at an interface-level. Reveals which
> users, apps, and protocols are consuming the most bandwidth. Provides
> multi-vendor support for NetFlow, J-Flow, sFlow and other flows. Make
> informed decisions using capacity planning reports.
> http://sdm.link/zohodev2dev 
> _______________________________________________
> saxon-help mailing list archived at http://saxon.markmail.org/ 
> [hidden email]
> https://lists.sourceforge.net/lists/listinfo/saxon-help


------------------------------------------------------------------------------
What NetFlow Analyzer can do for you? Monitors network bandwidth and traffic patterns at an interface-level. Reveals which users, apps, and protocols are consuming the most bandwidth. Provides multi-vendor support for NetFlow, J-Flow, sFlow and other flows. Make informed decisions using capacity planning reports. http://sdm.link/zohodev2dev _______________________________________________
saxon-help mailing list archived at http://saxon.markmail.org/ [hidden email] https://lists.sourceforge.net/lists/listinfo/saxon-help 

------------------------------------------------------------------------------
What NetFlow Analyzer can do for you? Monitors network bandwidth and traffic
patterns at an interface-level. Reveals which users, apps, and protocols are
consuming the most bandwidth. Provides multi-vendor support for NetFlow,
J-Flow, sFlow and other flows. Make informed decisions using capacity
planning reports. http://sdm.link/zohodev2dev
_______________________________________________
saxon-help mailing list archived at http://saxon.markmail.org/
[hidden email]
https://lists.sourceforge.net/lists/listinfo/saxon-help 

Program - Copy.cs (11K) Download Attachment
Loading...