Quantcast

Illegal XML caracter produced by Saxon XSLT : 

classic Classic list List threaded Threaded
4 messages Options
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Illegal XML caracter produced by Saxon XSLT : 

Gueugaie
Hi list,

Using java8 (1.8.0_65) and Saxon HE 9.7.0-1, I encounter a strange behavior when reading back some XML output by Saxon, specifically when text elements contain some special caracters, such as \u0015.

Admidetdly, trying to serialize such specials caracters has litlle interest, but I guess whatever you throw at it, the Transformer should always produce valid XML outputs, and here, both Woodstox (XMLStreamReader)  and the default JDK (XMLStream Reader or DOM using Sax)

Here is a sample code to demonstrate the behavior :

First, create a dummy document using the DOM API (here a single element-document), and serialize to String using Saxon IdentityTransformer.

    Document doc = DocumentBuilderFactory.newInstance().newDocumentBuilder().newDocument();
    Element root = doc.createElement("root");
    root.setTextContent("Test \u0015");
    doc.appendChild(root);
    Transformer transformer = TransformerFactoryUtil.newSpecificTransformerFactory("net.sf.saxon.TransformerFactoryImpl").newTransformer();
    transformer.setOutputProperty(OutputKeys.METHOD, "xml");
    StringWriter output = new StringWriter();
    transformer.transform(new DOMSource(doc), new StreamResult(output));

Which gives this document :
<?xml version="1.0" encoding="UTF-8"?><root>Test &#x15;</root>


Then no matter which way I try to parse the document back, I get an exception. Using Woodstox :
    try (Reader stringReader = new StringReader(output.toString())) {
      System.setProperty("javax.xml.stream.XMLInputFactory", WstxInputFactory.class.getName());
      XMLStreamReader r = XMLInputFactory.newFactory().createXMLStreamReader(stringReader);
      String typeOfReader = r.getClass().getName();
      try {
        while (r.hasNext()) {
          r.next();
        }
        System.out.println("done");
      } catch (Exception e) {
        e.printStackTrace();
        System.out.println("Failed with parser of type " + typeOfReader);
      }
    }

I get : com.ctc.wstx.exc.WstxParsingException: Illegal character entity: expansion character (code 0x15 at [row,col {unknown-source}]: [1,55]

Using the default JDK parser, same code, with a property reset to
      System.setProperty("javax.xml.stream.XMLInputFactory", "com.sun.xml.internal.stream.XMLInputFactoryImpl");

Gets me : javax.xml.stream.XMLStreamException: ParseError at [row,col]:[1,56] Message: La référence de caractère "&#

And using the DOM parser :
    try {
      DocumentBuilderFactory.newInstance().newDocumentBuilder().parse(new ByteArrayInputStream(output.toString().getBytes()));
      System.out.println("done using DOM");
    } catch (Exception e) {
      e.printStackTrace();
      System.out.println("Failed with parser of type DOM");
    }

I also get : [Fatal Error] :1:56: Character reference "&#x15" is an invalid XML character.
org.xml.sax.SAXParseException; lineNumber: 1; columnNumber: 56; Character reference "&#x15" is an invalid XML character.

So I'm no specialist of XML escaping and entities, but I guess Saxon produces an invalid document somehow. Is this expected behavior ?
Is there some trick that would help in producing a valid document ?

Thanks,
Guillaume


------------------------------------------------------------------------------
Site24x7 APM Insight: Get Deep Visibility into Application Performance
APM + Mobile APM + RUM: Monitor 3 App instances at just $35/Month
Monitor end-to-end web transactions and take corrective actions now
Troubleshoot faster and improve end-user experience. Signup Now!
http://pubads.g.doubleclick.net/gampad/clk?id=272487151&iu=/4140
_______________________________________________
saxon-help mailing list archived at http://saxon.markmail.org/
[hidden email]
https://lists.sourceforge.net/lists/listinfo/saxon-help 
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: Illegal XML caracter produced by Saxon XSLT : &#x15;

Michael Kay
Thanks for reporting it.

The character x15 is legal in XML 1.1 but not in XML 1.0. It is valid if:

(a) it is written as &#x15;
(b) and the XML declaration specifies version="1.1"
(c) and the parser is an XML 1.1 parser.

The third condition is basically your responsibility - you get to choose which parser to use (the default parser in the JDK will accept XML 1.1). But the failure here is on (b).

You should be able to set the XML declaration correctly using

transformer.setOutputProperty(OutputKeys.VERSION, "1.1")

That leaves the question of whether Saxon could or should do anything differently.

When input comes from a DOM, sadly, there is very little guarantee that the data follows any rules about consistency and well-formedness. This is one of the biggest weaknesses of DOM. We have no way of asking the DOM whether it's an XML 1.0 DOM or an XML 1.1 DOM. Checking every character that it contains (and doing all the other integrity checks that would be needed to guarantee that we never produce bad output) would be seriously expensive. So our general policy with DOM is (a) encourage users not to use it, and (b) if they must use it, assume that the data found in the DOM is sound, without further validation.

So I think that the answer is: if you provide us with input in the form of a DOM, and it contains XML 1.1 characters, it's your responsibility to configure Saxon and its serializer to emit valid XML 1.1.

Michael Kay
Saxonica

> On 19 Feb 2016, at 09:40, Gueugaie <[hidden email]> wrote:
>
> Hi list,
>
> Using java8 (1.8.0_65) and Saxon HE 9.7.0-1, I encounter a strange behavior when reading back some XML output by Saxon, specifically when text elements contain some special caracters, such as \u0015.
>
> Admidetdly, trying to serialize such specials caracters has litlle interest, but I guess whatever you throw at it, the Transformer should always produce valid XML outputs, and here, both Woodstox (XMLStreamReader)  and the default JDK (XMLStream Reader or DOM using Sax)
>
> Here is a sample code to demonstrate the behavior :
>
> First, create a dummy document using the DOM API (here a single element-document), and serialize to String using Saxon IdentityTransformer.
>
>     Document doc = DocumentBuilderFactory.newInstance().newDocumentBuilder().newDocument();
>     Element root = doc.createElement("root");
>     root.setTextContent("Test \u0015");
>     doc.appendChild(root);
>     Transformer transformer = TransformerFactoryUtil.newSpecificTransformerFactory("net.sf.saxon.TransformerFactoryImpl").newTransformer();
>     transformer.setOutputProperty(OutputKeys.METHOD, "xml");
>     StringWriter output = new StringWriter();
>     transformer.transform(new DOMSource(doc), new StreamResult(output));
>
> Which gives this document :
> <?xml version="1.0" encoding="UTF-8"?><root>Test &#x15;</root>
>
>
> Then no matter which way I try to parse the document back, I get an exception. Using Woodstox :
>     try (Reader stringReader = new StringReader(output.toString())) {
>       System.setProperty("javax.xml.stream.XMLInputFactory", WstxInputFactory.class.getName());
>       XMLStreamReader r = XMLInputFactory.newFactory().createXMLStreamReader(stringReader);
>       String typeOfReader = r.getClass().getName();
>       try {
>         while (r.hasNext()) {
>           r.next();
>         }
>         System.out.println("done");
>       } catch (Exception e) {
>         e.printStackTrace();
>         System.out.println("Failed with parser of type " + typeOfReader);
>       }
>     }
>
> I get : com.ctc.wstx.exc.WstxParsingException: Illegal character entity: expansion character (code 0x15 at [row,col {unknown-source}]: [1,55]
>
> Using the default JDK parser, same code, with a property reset to
>       System.setProperty("javax.xml.stream.XMLInputFactory", "com.sun.xml.internal.stream.XMLInputFactoryImpl");
>
> Gets me : javax.xml.stream.XMLStreamException: ParseError at [row,col]:[1,56] Message: La référence de caractère "&#
>
> And using the DOM parser :
>     try {
>       DocumentBuilderFactory.newInstance().newDocumentBuilder().parse(new ByteArrayInputStream(output.toString().getBytes()));
>       System.out.println("done using DOM");
>     } catch (Exception e) {
>       e.printStackTrace();
>       System.out.println("Failed with parser of type DOM");
>     }
>
> I also get : [Fatal Error] :1:56: Character reference "&#x15" is an invalid XML character.
> org.xml.sax.SAXParseException; lineNumber: 1; columnNumber: 56; Character reference "&#x15" is an invalid XML character.
>
> So I'm no specialist of XML escaping and entities, but I guess Saxon produces an invalid document somehow. Is this expected behavior ?
> Is there some trick that would help in producing a valid document ?
>
> Thanks,
> Guillaume
>
> ------------------------------------------------------------------------------
> Site24x7 APM Insight: Get Deep Visibility into Application Performance
> APM + Mobile APM + RUM: Monitor 3 App instances at just $35/Month
> Monitor end-to-end web transactions and take corrective actions now
> Troubleshoot faster and improve end-user experience. Signup Now!
> http://pubads.g.doubleclick.net/gampad/clk?id=272487151&iu=/4140_______________________________________________
> saxon-help mailing list archived at http://saxon.markmail.org/
> [hidden email]
> https://lists.sourceforge.net/lists/listinfo/saxon-help



------------------------------------------------------------------------------
Site24x7 APM Insight: Get Deep Visibility into Application Performance
APM + Mobile APM + RUM: Monitor 3 App instances at just $35/Month
Monitor end-to-end web transactions and take corrective actions now
Troubleshoot faster and improve end-user experience. Signup Now!
http://pubads.g.doubleclick.net/gampad/clk?id=272487151&iu=/4140
_______________________________________________
saxon-help mailing list archived at http://saxon.markmail.org/
[hidden email]
https://lists.sourceforge.net/lists/listinfo/saxon-help 
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: Illegal XML caracter produced by Saxon XSLT : &#x15;

Ludovic Kuty
In reply to this post by Gueugaie
Hi,

XML characters have to match the production Char in the XML spec. The
code point 0x15 is not included in the range.

https://www.w3.org/TR/REC-xml/#NT-Char
Character Range
[2]       Char       ::=       #x9 | #xA | #xD | [#x20-#xD7FF] |
[#xE000-#xFFFD] | [#x10000-#x10FFFF]    /* any Unicode character,
excluding the surrogate blocks, FFFE, and FFFF. */

Ludovic Kuty

Gueugaie wrote:

> Hi list,
>
> Using java8 (1.8.0_65) and Saxon HE 9.7.0-1, I encounter a strange
> behavior when reading back some XML output by Saxon, specifically when
> text elements contain some special caracters, such as \u0015.
>
> Admidetdly, trying to serialize such specials caracters has litlle
> interest, but I guess whatever you throw at it, the Transformer should
> always produce valid XML outputs, and here, both Woodstox
> (XMLStreamReader)  and the default JDK (XMLStream Reader or DOM using
> Sax)
>
> Here is a sample code to demonstrate the behavior :
>
> First, create a dummy document using the DOM API (here a single
> element-document), and serialize to String using Saxon
> IdentityTransformer.
>
>     Document doc =
> DocumentBuilderFactory.newInstance().newDocumentBuilder().newDocument();
>     Element root = doc.createElement("root");
>     root.setTextContent("Test \u0015");
>     doc.appendChild(root);
>     Transformer transformer =
> TransformerFactoryUtil.newSpecificTransformerFactory("net.sf.saxon.TransformerFactoryImpl").newTransformer();
>     transformer.setOutputProperty(OutputKeys.METHOD, "xml");
>     StringWriter output = new StringWriter();
>     transformer.transform(new DOMSource(doc), new StreamResult(output));
>
> Which gives this document :
> <?xml version="1.0" encoding="UTF-8"?><root>Test &#x15;</root>
>
>
> Then no matter which way I try to parse the document back, I get an
> exception. Using Woodstox :
>     try (Reader stringReader = new StringReader(output.toString())) {
>       System.setProperty("javax.xml.stream.XMLInputFactory",
> WstxInputFactory.class.getName());
>       XMLStreamReader r =
> XMLInputFactory.newFactory().createXMLStreamReader(stringReader);
>       String typeOfReader = r.getClass().getName();
>       try {
>         while (r.hasNext()) {
>           r.next();
>         }
>         System.out.println("done");
>       } catch (Exception e) {
>         e.printStackTrace();
>         System.out.println("Failed with parser of type " + typeOfReader);
>       }
>     }
>
> I get : com.ctc.wstx.exc.WstxParsingException: Illegal character
> entity: expansion character (code 0x15 at [row,col {unknown-source}]:
> [1,55]
>
> Using the default JDK parser, same code, with a property reset to
>       System.setProperty("javax.xml.stream.XMLInputFactory",
> "com.sun.xml.internal.stream.XMLInputFactoryImpl");
>
> Gets me : javax.xml.stream.XMLStreamException: ParseError at
> [row,col]:[1,56] Message: La référence de caractère "&#
>
> And using the DOM parser :
>     try {
>      
> DocumentBuilderFactory.newInstance().newDocumentBuilder().parse(new
> ByteArrayInputStream(output.toString().getBytes()));
>       System.out.println("done using DOM");
>     } catch (Exception e) {
>       e.printStackTrace();
>       System.out.println("Failed with parser of type DOM");
>     }
>
> I also get : [Fatal Error] :1:56: Character reference "&#x15" is an
> invalid XML character.
> org.xml.sax.SAXParseException; lineNumber: 1; columnNumber: 56;
> Character reference "&#x15" is an invalid XML character.
>
> So I'm no specialist of XML escaping and entities, but I guess Saxon
> produces an invalid document somehow. Is this expected behavior ?
> Is there some trick that would help in producing a valid document ?
>
> Thanks,
> Guillaume
>
> ------------------------------------------------------------------------------
> Site24x7 APM Insight: Get Deep Visibility into Application Performance
> APM + Mobile APM + RUM: Monitor 3 App instances at just $35/Month
> Monitor end-to-end web transactions and take corrective actions now
> Troubleshoot faster and improve end-user experience. Signup Now!
> http://pubads.g.doubleclick.net/gampad/clk?id=272487151&iu=/4140
> _______________________________________________
> saxon-help mailing list archived at http://saxon.markmail.org/
> [hidden email]
> https://lists.sourceforge.net/lists/listinfo/saxon-help


------------------------------------------------------------------------------
Site24x7 APM Insight: Get Deep Visibility into Application Performance
APM + Mobile APM + RUM: Monitor 3 App instances at just $35/Month
Monitor end-to-end web transactions and take corrective actions now
Troubleshoot faster and improve end-user experience. Signup Now!
http://pubads.g.doubleclick.net/gampad/clk?id=272487151&iu=/4140
_______________________________________________
saxon-help mailing list archived at http://saxon.markmail.org/
[hidden email]
https://lists.sourceforge.net/lists/listinfo/saxon-help 
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: Illegal XML caracter produced by Saxon XSLT : &#x15;

Gueugaie
Thanks all for these details and for reporting this 1.0 vs 1.1 difference among other things.
I can confirm that specifying a 1.1 version at the transformer level actually works, and makes the document readable again.

I'll try to see how I can get away from DOM in my code, or switch to 1.1, now that I am pointing in the right direction.

Thanks



2016-02-19 11:08 GMT+01:00 Ludovic Kuty <[hidden email]>:
Hi,

XML characters have to match the production Char in the XML spec. The
code point 0x15 is not included in the range.

https://www.w3.org/TR/REC-xml/#NT-Char
Character Range
[2]       Char       ::=       #x9 | #xA | #xD | [#x20-#xD7FF] |
[#xE000-#xFFFD] | [#x10000-#x10FFFF]    /* any Unicode character,
excluding the surrogate blocks, FFFE, and FFFF. */

Ludovic Kuty

Gueugaie wrote:
> Hi list,
>
> Using java8 (1.8.0_65) and Saxon HE 9.7.0-1, I encounter a strange
> behavior when reading back some XML output by Saxon, specifically when
> text elements contain some special caracters, such as \u0015.
>
> Admidetdly, trying to serialize such specials caracters has litlle
> interest, but I guess whatever you throw at it, the Transformer should
> always produce valid XML outputs, and here, both Woodstox
> (XMLStreamReader)  and the default JDK (XMLStream Reader or DOM using
> Sax)
>
> Here is a sample code to demonstrate the behavior :
>
> First, create a dummy document using the DOM API (here a single
> element-document), and serialize to String using Saxon
> IdentityTransformer.
>
>     Document doc =
> DocumentBuilderFactory.newInstance().newDocumentBuilder().newDocument();
>     Element root = doc.createElement("root");
>     root.setTextContent("Test \u0015");
>     doc.appendChild(root);
>     Transformer transformer =
> TransformerFactoryUtil.newSpecificTransformerFactory("net.sf.saxon.TransformerFactoryImpl").newTransformer();
>     transformer.setOutputProperty(OutputKeys.METHOD, "xml");
>     StringWriter output = new StringWriter();
>     transformer.transform(new DOMSource(doc), new StreamResult(output));
>
> Which gives this document :
> <?xml version="1.0" encoding="UTF-8"?><root>Test &#x15;</root>
>
>
> Then no matter which way I try to parse the document back, I get an
> exception. Using Woodstox :
>     try (Reader stringReader = new StringReader(output.toString())) {
>       System.setProperty("javax.xml.stream.XMLInputFactory",
> WstxInputFactory.class.getName());
>       XMLStreamReader r =
> XMLInputFactory.newFactory().createXMLStreamReader(stringReader);
>       String typeOfReader = r.getClass().getName();
>       try {
>         while (r.hasNext()) {
>           r.next();
>         }
>         System.out.println("done");
>       } catch (Exception e) {
>         e.printStackTrace();
>         System.out.println("Failed with parser of type " + typeOfReader);
>       }
>     }
>
> I get : com.ctc.wstx.exc.WstxParsingException: Illegal character
> entity: expansion character (code 0x15 at [row,col {unknown-source}]:
> [1,55]
>
> Using the default JDK parser, same code, with a property reset to
>       System.setProperty("javax.xml.stream.XMLInputFactory",
> "com.sun.xml.internal.stream.XMLInputFactoryImpl");
>
> Gets me : javax.xml.stream.XMLStreamException: ParseError at
> [row,col]:[1,56] Message: La référence de caractère "&#
>
> And using the DOM parser :
>     try {
>
> DocumentBuilderFactory.newInstance().newDocumentBuilder().parse(new
> ByteArrayInputStream(output.toString().getBytes()));
>       System.out.println("done using DOM");
>     } catch (Exception e) {
>       e.printStackTrace();
>       System.out.println("Failed with parser of type DOM");
>     }
>
> I also get : [Fatal Error] :1:56: Character reference "&#x15" is an
> invalid XML character.
> org.xml.sax.SAXParseException; lineNumber: 1; columnNumber: 56;
> Character reference "&#x15" is an invalid XML character.
>
> So I'm no specialist of XML escaping and entities, but I guess Saxon
> produces an invalid document somehow. Is this expected behavior ?
> Is there some trick that would help in producing a valid document ?
>
> Thanks,
> Guillaume
>
> ------------------------------------------------------------------------------
> Site24x7 APM Insight: Get Deep Visibility into Application Performance
> APM + Mobile APM + RUM: Monitor 3 App instances at just $35/Month
> Monitor end-to-end web transactions and take corrective actions now
> Troubleshoot faster and improve end-user experience. Signup Now!
> http://pubads.g.doubleclick.net/gampad/clk?id=272487151&iu=/4140
> _______________________________________________
> saxon-help mailing list archived at http://saxon.markmail.org/
> [hidden email]
> https://lists.sourceforge.net/lists/listinfo/saxon-help


------------------------------------------------------------------------------
Site24x7 APM Insight: Get Deep Visibility into Application Performance
APM + Mobile APM + RUM: Monitor 3 App instances at just $35/Month
Monitor end-to-end web transactions and take corrective actions now
Troubleshoot faster and improve end-user experience. Signup Now!
http://pubads.g.doubleclick.net/gampad/clk?id=272487151&iu=/4140
_______________________________________________
saxon-help mailing list archived at http://saxon.markmail.org/
[hidden email]
https://lists.sourceforge.net/lists/listinfo/saxon-help


------------------------------------------------------------------------------
Site24x7 APM Insight: Get Deep Visibility into Application Performance
APM + Mobile APM + RUM: Monitor 3 App instances at just $35/Month
Monitor end-to-end web transactions and take corrective actions now
Troubleshoot faster and improve end-user experience. Signup Now!
http://pubads.g.doubleclick.net/gampad/clk?id=272487151&iu=/4140
_______________________________________________
saxon-help mailing list archived at http://saxon.markmail.org/
[hidden email]
https://lists.sourceforge.net/lists/listinfo/saxon-help 
Loading...