How to get Saxon to stop putting newlines after CDATA tags?

classic Classic list List threaded Threaded
4 messages Options
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

How to get Saxon to stop putting newlines after CDATA tags?

Doucet, Paul

Hi:

 

I am trying to use saxon HE 9.6 and XSLT to transform some XML doc that contain CDATA sections.  I want these preserved “as is” in the output doc.  But Saxon is adding newline characters after the CDATA tag before the actual start of the CDATA and another one before the CDATA closing tag.  The results screw up the importer of the transformed document.

 

Here is my XSLT:

 

<?xml version="1.0" encoding="utf-8"?>

 

<xsl:stylesheet version="2.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform">

 

  <!-- we want the contents of any RTF or TEXT to be formatted as CDATA. -->

 

  <xsl:output method="xml" indent="no" encoding="utf-8" cdata-section-elements="RTF TEXT"/>

 

  <xsl:template match="@*|node()">

      <xsl:copy>

        <xsl:apply-templates select="@*|node()" />

      </xsl:copy>

  </xsl:template>

</xsl:stylesheet>

 

This is my input node:

 

              <Command name="review of systems" group="" enabled="true" states="">

                     <description></description>

                     <contents type="TEXT-GRAPHICS">

                           <RTF>

<![CDATA[{\rtf1\ansi\ansicpg1252\deff0\deflang1033{\fonttbl{\f0\fnil\fcharset0 MS Sans Serif;}}

\viewkind4\uc1\pard\f0\fs17 Pt denies fevers, chills, nausea, vomiting, or sick contacts.

}]]>

                           </RTF>

                           <TEXT>

<![CDATA[Pt denies fevers, chills, nausea, vomiting, or sick contacts.]]>

                           </TEXT>

                     </contents>

              </Command>

 

This is what I get for output:

 

<?xml version="1.0" encoding="utf-8"?><MyCommands version="2.0" language="0x409"><Commands type="global"><Command name="review of systems" group="" enabled="true" states=""><description/><contents type="TEXT-GRAPHICS">

                           <RTF><![CDATA[

{\rtf1\ansi\ansicpg1252\deff0\deflang1033{\fonttbl{\f0\fnil\fcharset0 MS Sans Serif;}}

\viewkind4\uc1\pard\f0\fs17 Pt denies fevers, chills, nausea, vomiting, or sick contacts.

}

                           ]]></RTF>

                           <TEXT><![CDATA[

Pt denies fevers, chills, nausea, vomiting, or sick contacts.

                           ]]></TEXT>

                     </contents></Command></Commands></MyCommands>

 

I’m new XSLT and could use some expert advice.  I thought whatever followed the CDATA tag was not supposed to be modified in any way.

 

Thanks,

 

-Paul D.

 


------------------------------------------------------------------------------
Check out the vibrant tech community on one of the world's most
engaging tech sites, SlashDot.org! http://sdm.link/slashdot
_______________________________________________
saxon-help mailing list archived at http://saxon.markmail.org/
[hidden email]
https://lists.sourceforge.net/lists/listinfo/saxon-help 
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: How to get Saxon to stop putting newlines after CDATA tags?

Martin Honnen-2
On 22.02.2017 18:07, Doucet, Paul wrote:

> I am trying to use saxon HE 9.6 and XSLT to transform some XML doc that
> contain CDATA sections.  I want these preserved “as is” in the output
> doc.  But Saxon is adding newline characters after the CDATA tag before
> the actual start of the CDATA and another one before the CDATA closing
> tag.  The results screw up the importer of the transformed document.

> I’m new XSLT and could use some expert advice.  I thought whatever
> followed the CDATA tag was not supposed to be modified in any way.

What happens is that the XSLT processor uses an XML parser to build a
tree of nodes where any CDATA section is handled by the XML parser and
the XSLT processor works with an "RTF" element node containing a text
child node without even knowing or being able to establish that there
was a CDATA section in the original lexical markup. Then XSLT processor
then processes the XSLT code and creates a result tree with a copy of
that RTF element and when serializing the result tree honors your
request to output all the contents of the RTF element as a single CDATA
section. So that way what was an RTF element with some white space
followed by a CDATA section followed by some white space in the input
becomes an RTF element in the output with a single CDATA section
containing the white space plus the text plus the white space.

There is no easy way around this with any XSLT processor as the
XSLT/XPath data model does not distinguish normal text from CDATA
section text, instead these are collapsed into a single text node.

So you would need to preprocess the XML to convert the CDATA section
into XML markup an XSLT processor could distinguish, if you want to use
XSLT on your input.

Andrew Welch's LexEv tool allows that. http://andrewjwelch.com/lexev/


------------------------------------------------------------------------------
Check out the vibrant tech community on one of the world's most
engaging tech sites, SlashDot.org! http://sdm.link/slashdot
_______________________________________________
saxon-help mailing list archived at http://saxon.markmail.org/
[hidden email]
https://lists.sourceforge.net/lists/listinfo/saxon-help 
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: How to get Saxon to stop putting newlines after CDATA tags?

Michael Kay
In reply to this post by Doucet, Paul
Firstly, Saxon can only preserve things that it's aware of, and the only thing it's aware of is the XDM data model (not the original lexical XML). In the XDM data model, the CDATA sections have been merged with adjacent text nodes. So the whitespace before and after the CDATA is treated exactly the same by XSLT as if it was within the CDATA.

In your xsl:output, when you say cdata-section-elements="RTF TEXT", that's asking for the entire contents of these elements to be wrapped in CDATA - which is what is happening.

The design of XSLT (and in particular, of XDM) is based on the assumption that CDATA has no information-bearing significance. That is, the following constructs are precisely equivalent:

(a) <z>  </z>

(b) <z><![CDATA[  ]]></z>

(c) <z>&#x20;&#x20;</z>

Putting stuff in CDATA is just a way to avoid having to escape special characters like "<" and "&".

Now, if your receiving application is failing when the CDATA section boundaries are in the wrong place, then it looks as if that application is not following this convention: it's using CDATA in a way that conveys information, just like element markup does. That basically makes it incompatible with XSLT.

The workaround is to convert the CDATA markup to element markup before XSLT gets to see it. For example, you can do this with a tool such as sed: simply replace <![CDATA[ by <cdata>, and ]]> by </cdata>, everywhere that it appears. Alternatively, there's a tool called lexev from Andrew Welch which does the same thing but as a filter between the XML parser and the XSLT processor.

Michael Kay
Saxonica


On 22 Feb 2017, at 17:07, Doucet, Paul <[hidden email]> wrote:

Hi:

 

I am trying to use saxon HE 9.6 and XSLT to transform some XML doc that contain CDATA sections.  I want these preserved “as is” in the output doc.  But Saxon is adding newline characters after the CDATA tag before the actual start of the CDATA and another one before the CDATA closing tag.  The results screw up the importer of the transformed document.

 

Here is my XSLT:

 

<?xml version="1.0" encoding="utf-8"?>

 

<xsl:stylesheet version="2.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform">

 

  <!-- we want the contents of any RTF or TEXT to be formatted as CDATA. -->

 

  <xsl:output method="xml" indent="no" encoding="utf-8" cdata-section-elements="RTF TEXT"/>

 

  <xsl:template match="@*|node()">

      <xsl:copy>

        <xsl:apply-templates select="@*|node()" />

      </xsl:copy>

  </xsl:template>

</xsl:stylesheet>

 

This is my input node:

 

              <Command name="review of systems" group="" enabled="true" states="">

                     <description></description>

                     <contents type="TEXT-GRAPHICS">

                           <RTF>

<![CDATA[{\rtf1\ansi\ansicpg1252\deff0\deflang1033{\fonttbl{\f0\fnil\fcharset0 MS Sans Serif;}}

\viewkind4\uc1\pard\f0\fs17 Pt denies fevers, chills, nausea, vomiting, or sick contacts.

}]]>

                           </RTF>

                           <TEXT>

<![CDATA[Pt denies fevers, chills, nausea, vomiting, or sick contacts.]]>

                           </TEXT>

                     </contents>

              </Command>

 

This is what I get for output:

 

<?xml version="1.0" encoding="utf-8"?><MyCommands version="2.0" language="0x409"><Commands type="global"><Command name="review of systems" group="" enabled="true" states=""><description/><contents type="TEXT-GRAPHICS">

                           <RTF><![CDATA[

{\rtf1\ansi\ansicpg1252\deff0\deflang1033{\fonttbl{\f0\fnil\fcharset0 MS Sans Serif;}}

\viewkind4\uc1\pard\f0\fs17 Pt denies fevers, chills, nausea, vomiting, or sick contacts.

}

                           ]]></RTF>

                           <TEXT><![CDATA[

Pt denies fevers, chills, nausea, vomiting, or sick contacts.

                           ]]></TEXT>

                     </contents></Command></Commands></MyCommands>

 

I’m new XSLT and could use some expert advice.  I thought whatever followed the CDATA tag was not supposed to be modified in any way.

 

Thanks,

 

-Paul D.

 

------------------------------------------------------------------------------
Check out the vibrant tech community on one of the world's most
engaging tech sites, SlashDot.org! http://sdm.link/slashdot_______________________________________________
saxon-help mailing list archived at http://saxon.markmail.org/
[hidden email]
https://lists.sourceforge.net/lists/listinfo/saxon-help


------------------------------------------------------------------------------
Check out the vibrant tech community on one of the world's most
engaging tech sites, SlashDot.org! http://sdm.link/slashdot
_______________________________________________
saxon-help mailing list archived at http://saxon.markmail.org/
[hidden email]
https://lists.sourceforge.net/lists/listinfo/saxon-help 
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: [EXTERNAL] Re: How to get Saxon to stop putting newlines after CDATA tags?

Doucet, Paul

Thanks for the prompt reply.

 

From: Michael Kay [mailto:[hidden email]]
Sent: Wednesday, February 22, 2017 12:50 PM
To: Mailing list for the SAXON XSLT and XQuery processor
Subject: [EXTERNAL] Re: [saxon] How to get Saxon to stop putting newlines after CDATA tags?

 

Firstly, Saxon can only preserve things that it's aware of, and the only thing it's aware of is the XDM data model (not the original lexical XML). In the XDM data model, the CDATA sections have been merged with adjacent text nodes. So the whitespace before and after the CDATA is treated exactly the same by XSLT as if it was within the CDATA.

 

In your xsl:output, when you say cdata-section-elements="RTF TEXT", that's asking for the entire contents of these elements to be wrapped in CDATA - which is what is happening.

 

The design of XSLT (and in particular, of XDM) is based on the assumption that CDATA has no information-bearing significance. That is, the following constructs are precisely equivalent:

 

(a) <z>  </z>

 

(b) <z><![CDATA[  ]]></z>

 

(c) <z>&#x20;&#x20;</z>

 

Putting stuff in CDATA is just a way to avoid having to escape special characters like "<" and "&".

 

Now, if your receiving application is failing when the CDATA section boundaries are in the wrong place, then it looks as if that application is not following this convention: it's using CDATA in a way that conveys information, just like element markup does. That basically makes it incompatible with XSLT.

 

The workaround is to convert the CDATA markup to element markup before XSLT gets to see it. For example, you can do this with a tool such as sed: simply replace <![CDATA[ by <cdata>, and ]]> by </cdata>, everywhere that it appears. Alternatively, there's a tool called lexev from Andrew Welch which does the same thing but as a filter between the XML parser and the XSLT processor.

 

Michael Kay

Saxonica

 

 

On 22 Feb 2017, at 17:07, Doucet, Paul <[hidden email]> wrote:

 

Hi:

 

I am trying to use saxon HE 9.6 and XSLT to transform some XML doc that contain CDATA sections.  I want these preserved “as is” in the output doc.  But Saxon is adding newline characters after the CDATA tag before the actual start of the CDATA and another one before the CDATA closing tag.  The results screw up the importer of the transformed document.

 

Here is my XSLT:

 

<?xml version="1.0" encoding="utf-8"?>

 

<xsl:stylesheet version="2.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform">

 

  <!-- we want the contents of any RTF or TEXT to be formatted as CDATA. -->

 

  <xsl:output method="xml" indent="no" encoding="utf-8" cdata-section-elements="RTF TEXT"/>

 

  <xsl:template match="@*|node()">

      <xsl:copy>

        <xsl:apply-templates select="@*|node()" />

      </xsl:copy>

  </xsl:template>

</xsl:stylesheet>

 

This is my input node:

 

              <Command name="review of systems" group="" enabled="true" states="">

                     <description></description>

                     <contents type="TEXT-GRAPHICS">

                           <RTF>

<![CDATA[{\rtf1\ansi\ansicpg1252\deff0\deflang1033{\fonttbl{\f0\fnil\fcharset0 MS Sans Serif;}}

\viewkind4\uc1\pard\f0\fs17 Pt denies fevers, chills, nausea, vomiting, or sick contacts.

}]]>

                           </RTF>

                           <TEXT>

<![CDATA[Pt denies fevers, chills, nausea, vomiting, or sick contacts.]]>

                           </TEXT>

                     </contents>

              </Command>

 

This is what I get for output:

 

<?xml version="1.0" encoding="utf-8"?><MyCommands version="2.0" language="0x409"><Commands type="global"><Command name="review of systems" group="" enabled="true" states=""><description/><contents type="TEXT-GRAPHICS">

                           <RTF><![CDATA[

{\rtf1\ansi\ansicpg1252\deff0\deflang1033{\fonttbl{\f0\fnil\fcharset0 MS Sans Serif;}}

\viewkind4\uc1\pard\f0\fs17 Pt denies fevers, chills, nausea, vomiting, or sick contacts.

}

                           ]]></RTF>

                           <TEXT><![CDATA[

Pt denies fevers, chills, nausea, vomiting, or sick contacts.

                           ]]></TEXT>

                     </contents></Command></Commands></MyCommands>

 

I’m new XSLT and could use some expert advice.  I thought whatever followed the CDATA tag was not supposed to be modified in any way.

 

Thanks,

 

-Paul D.

 

------------------------------------------------------------------------------
Check out the vibrant tech community on one of the world's most
engaging tech sites, SlashDot.org! http://sdm.link/slashdot_______________________________________________
saxon-help mailing list archived at http://saxon.markmail.org/
[hidden email]
https://lists.sourceforge.net/lists/listinfo/saxon-help

 


------------------------------------------------------------------------------
Check out the vibrant tech community on one of the world's most
engaging tech sites, SlashDot.org! http://sdm.link/slashdot
_______________________________________________
saxon-help mailing list archived at http://saxon.markmail.org/
[hidden email]
https://lists.sourceforge.net/lists/listinfo/saxon-help 
Loading...