Does SAXON provide a setting to inform its XML parser to remove CDATA sections?

classic Classic list List threaded Threaded
2 messages Options
Reply | Threaded
Open this post in threaded view
|

Does SAXON provide a setting to inform its XML parser to remove CDATA sections?

Costello, Roger L.
Hello Michael,

This XML document has a CDATA section:

<book>
    <title>Parsing Techniques</title>
    <author>Dick Grune</author>
    <author><![CDATA[blah, blah]]></author>
</book>

By the time that XML document gets to the XSLT processor the XML parser has removed the CDATA wrapper. So the XSLT processor sees this:

<book>
    <title>Parsing Techniques</title>
    <author>Dick Grune</author>
    <author>blah, blah</author>
</book>

Does SAXON provide a setting to instruct the XML parser that SAXON uses:

        Hey, if you encounter a CDATA section as you parse
        the XML document, please remove the CDATA section
        and its content.

Thus, when the XSLT processor gets the XML document, the XSLT processor sees an empty author element:

<book>
    <title>Parsing Techniques</title>
    <author>Dick Grune</author>
    <author></author>
</book>

Does such a CDATA-removal-setting exist in SAXON?

/Roger

------------------------------------------------------------------------------
Monitor 25 network devices or servers for free with OpManager!
OpManager is web-based network management software that monitors
network devices and physical & virtual servers, alerts via email & sms
for fault. Monitor 25 devices for free with no restriction. Download now
http://ad.doubleclick.net/ddm/clk/292181274;119417398;o
_______________________________________________
saxon-help mailing list archived at http://saxon.markmail.org/
[hidden email]
https://lists.sourceforge.net/lists/listinfo/saxon-help 
Reply | Threaded
Open this post in threaded view
|

Re: Does SAXON provide a setting to inform its XML parser to remove CDATA sections?

Michael Kay
You can achieve this by writing a SAX filter to sit between the parser and Saxon.

There’s an example here:

http://www.ibm.com/developerworks/library/x-tipsaxfilter/

When you supply input to Saxon in the form of a SAXSource, the SAXSource contains an XMLReader.

SAX provides the class XMLFilterImpl which looks to the XML parser like a ContentHandler, and to the application (Saxon) like an XML parser. So if your SAXSource contains an XMLFilter (or XMLFilterImpl) then Saxon treats it as if it was the parser, but actually it is filtering events from the parser. In this particular case it might be omitting any text node (characters() event) if it comes after a StartCDATA event and before the corresponding EndCDATA.

Actually, it’s slightly complicated by the fact that in SAX, CDATA start/end events are notified to the LexicalHandler and not to the ContentHandler. Saxon registers itself with the XMLReader to handle both sets of events, but XMLFilterImpl ignores lexical events. You could get around this by taking the source code of XMLFilterImpl and extending it so it handles LexicalHandler events as well as ContentHandler events. Or you could find the source of Andrew Welch’s LexEv utility, which presumably does something similar. (LexEv, however, passes MORE information to Saxon, eg. details of CDATA boundaries, whereas you want to pass LESS information).

Michael Kay
Saxonica


> On 25 Jun 2015, at 13:30, Costello, Roger L. <[hidden email]> wrote:
>
> Hello Michael,
>
> This XML document has a CDATA section:
>
> <book>
>    <title>Parsing Techniques</title>
>    <author>Dick Grune</author>
>    <author><![CDATA[blah, blah]]></author>
> </book>
>
> By the time that XML document gets to the XSLT processor the XML parser has removed the CDATA wrapper. So the XSLT processor sees this:
>
> <book>
>    <title>Parsing Techniques</title>
>    <author>Dick Grune</author>
>    <author>blah, blah</author>
> </book>
>
> Does SAXON provide a setting to instruct the XML parser that SAXON uses:
>
> Hey, if you encounter a CDATA section as you parse
> the XML document, please remove the CDATA section
> and its content.
>
> Thus, when the XSLT processor gets the XML document, the XSLT processor sees an empty author element:
>
> <book>
>    <title>Parsing Techniques</title>
>    <author>Dick Grune</author>
>    <author></author>
> </book>
>
> Does such a CDATA-removal-setting exist in SAXON?
>
> /Roger
>
> ------------------------------------------------------------------------------
> Monitor 25 network devices or servers for free with OpManager!
> OpManager is web-based network management software that monitors
> network devices and physical & virtual servers, alerts via email & sms
> for fault. Monitor 25 devices for free with no restriction. Download now
> http://ad.doubleclick.net/ddm/clk/292181274;119417398;o
> _______________________________________________
> saxon-help mailing list archived at http://saxon.markmail.org/
> [hidden email]
> https://lists.sourceforge.net/lists/listinfo/saxon-help 


------------------------------------------------------------------------------
Monitor 25 network devices or servers for free with OpManager!
OpManager is web-based network management software that monitors
network devices and physical & virtual servers, alerts via email & sms
for fault. Monitor 25 devices for free with no restriction. Download now
http://ad.doubleclick.net/ddm/clk/292181274;119417398;o
_______________________________________________
saxon-help mailing list archived at http://saxon.markmail.org/
[hidden email]
https://lists.sourceforge.net/lists/listinfo/saxon-help