Quantcast

indentation of the preamble of an xml document

classic Classic list List threaded Threaded
4 messages Options
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

indentation of the preamble of an xml document

Wolfhart Totschnig-2
Hello,

I have a question about how Saxon 9he indents the preamble of an xml document. Let me explain the question with a minimal example.

When performing the following identity transformation:

<?xml version="1.0" encoding="UTF-8"?>
<xsl:stylesheet version="2.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
    <xsl:output indent="yes" doctype-system="zettel.dtd"/>
    <xsl:template match="node()|@*">
        <xsl:copy>
            <xsl:apply-templates select="@*|node()"/>
        </xsl:copy>
    </xsl:template>
</xsl:stylesheet>

on the following xml document:

<?xml version="1.0" encoding="UTF-8"?>
<?xml-stylesheet type="text/xsl" href="zettel.xsl"?>
<!DOCTYPE zettel SYSTEM "zettel.dtd">
<zettel/>

Saxon 9he produces the following:

<?xml version="1.0" encoding="UTF-8"?>
<?xml-stylesheet type="text/xsl" href="zettel.xsl"?><!DOCTYPE zettel
  SYSTEM "zettel.dtd">
<zettel/>

That is, it puts no line break before "<!DOCTYPE" but does but a line break before "SYSTEM". Why is that? This way of indenting the preamble seems counterintuitive and ugly to me. Is this a bug? Or is there a reason for the behavior?

Thanks in advance for your help,
Wolfhart
------------------------------------------------------------------------------
Transform Data into Opportunity.
Accelerate data analysis in your applications with
Intel Data Analytics Acceleration Library.
Click to learn more.
http://pubads.g.doubleclick.net/gampad/clk?id=278785471&iu=/4140
_______________________________________________
saxon-help mailing list archived at http://saxon.markmail.org/
[hidden email]
https://lists.sourceforge.net/lists/listinfo/saxon-help 
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: indentation of the preamble of an xml document

Wolfhart Totschnig-2
Hello,

There has not been a reply to my question. Is the question too petty? I know that the issue has no real importance, but I would sill be curious to know why Saxon formats the preamble of an xml document the way it does. I would be grateful for a reply.

Wolfhart


On 03/28/2016 02:21 PM, Wolfhart Totschnig wrote:
Hello,

I have a question about how Saxon 9he indents the preamble of an xml document. Let me explain the question with a minimal example.

When performing the following identity transformation:

<?xml version="1.0" encoding="UTF-8"?>
<xsl:stylesheet version="2.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
    <xsl:output indent="yes" doctype-system="zettel.dtd"/>
    <xsl:template match="node()|@*">
        <xsl:copy>
            <xsl:apply-templates select="@*|node()"/>
        </xsl:copy>
    </xsl:template>
</xsl:stylesheet>

on the following xml document:

<?xml version="1.0" encoding="UTF-8"?>
<?xml-stylesheet type="text/xsl" href="zettel.xsl"?>
<!DOCTYPE zettel SYSTEM "zettel.dtd">
<zettel/>

Saxon 9he produces the following:

<?xml version="1.0" encoding="UTF-8"?>
<?xml-stylesheet type="text/xsl" href="zettel.xsl"?><!DOCTYPE zettel
  SYSTEM "zettel.dtd">
<zettel/>

That is, it puts no line break before "<!DOCTYPE" but does but a line break before "SYSTEM". Why is that? This way of indenting the preamble seems counterintuitive and ugly to me. Is this a bug? Or is there a reason for the behavior?

Thanks in advance for your help,
Wolfhart

------------------------------------------------------------------------------
Transform Data into Opportunity.
Accelerate data analysis in your applications with
Intel Data Analytics Acceleration Library.
Click to learn more.
http://pubads.g.doubleclick.net/gampad/clk?id=278785471&iu=/4140


_______________________________________________
saxon-help mailing list archived at http://saxon.markmail.org/
[hidden email]
https://lists.sourceforge.net/lists/listinfo/saxon-help 


------------------------------------------------------------------------------
Transform Data into Opportunity.
Accelerate data analysis in your applications with
Intel Data Analytics Acceleration Library.
Click to learn more.
http://pubads.g.doubleclick.net/gampad/clk?id=278785471&iu=/4140
_______________________________________________
saxon-help mailing list archived at http://saxon.markmail.org/
[hidden email]
https://lists.sourceforge.net/lists/listinfo/saxon-help 
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: indentation of the preamble of an xml document

Michael Kay
It's a surprisingly difficult question to answer, and the reasons are a mixture of design and accident.

The first thing I asked myself was, why is the DOCTYPE declaration output AFTER the xml-stylesheet processing instruction?

Simple answer: because the spec says so. XSLT2/XQuery1 serialization section 5.1.6: "If the doctype-system parameter is specified, the XML output method MUST output a document type declaration immediately before the first element." But as to why it says so, I'm really not sure. Clearly the generated XML would still be well-formed if the DOCTYPE declaration came first. The xml-stylesheet specification allows the PI to come either before or after the DOCTYPE declaration.

For indent=yes, the serialization spec says "Whitespace characters MUST NOT be added other than adjacent to an element node, that is, immediately before a start tag or immediately after an end tag." This means, for example, that whitespace cannot be added between two processing instructions.

This rule probably accounts for the fact that there is no whitespace between the processing instruction and the DOCTYPE declaration; the indentation rules don't allow whitespace to be inserted here. On a strict reading of the spec, indentation whitespace is permitted between the DOCTYPE declaration and the first element, but it is not permitted before the DOCTYPE declaration.

Is there a good reason for this rule in the serialization spec? No, I suspect not. It could have allowed whitespace to be added in this location. But getting the rules right was difficult, and the authors were cautious.

Can whitespace be added before the DOCTYPE declaration on some other basis? Perhaps, on a strict reading, no. Even though such whitespace would be ignored by the XML parser and would not affect the data model it constructs, the serialization spec does not give license to insert it. But there are other cases where Saxon adds such whitespace, for example between an XML declaration and a DOCTYPE declaration.

There are two reasons for adopting a very strict interpretation of the serialization spec. One is that some test suites verify test results by lexical character-by-character comparison of the generated XML. In the current XSLT and XQuery test suites we have moved away from this towards XPath-based test result assertions, but with earlier test suites, it was easier to pass the tests if you stuck very closely to the serialization rules. The second reason for strictness is that the entity being generated might be an external general parsed entity rather than a document entity, and we don't actually know which is intended. In an EGPE, whitespace between the XML declaration (technically, the text declaration) and the first start tag is significant. An EGPE will never have a DOCTYPE declaration, of course, so this rule doesn't prevent generation of whitespace adjacent to the DOCTYPE declaration, but the code is written cautiously because of this possibility.

The XMLEmitter (implementing the XML serialization method in Saxon), immediately before outputting "<!DOCTYPE", has the logic

if (declarationIsWritten && !indenting) {
    // don't add a newline if indenting, because the indenter will already have done so
    writer.write("\n");
}

The effect of this code is that when indenting is off, Saxon does output a newline before the DOCTYPE declaration provided that an XML declaration has been written. The comment indicates that the author of this code thought that the indenter (which operates on the stream of events generated by the transformation before they are passed to the XMLEmitter) will have inserted a newline before the first element, and a second newline is not needed. In fact, in this particular example, this is not the case: the serializer inserts a newline before a start tag only if the previous thing output (not counting whitespace) was a start tag or end tag.

So the explanation here is perhaps that the code structure divides responsibility between the indenter and the emitter, and on this occasion neither has inserted a newline, because both thought the other would do so.

In all of this the policy is (a) conform to the spec, (b) do no harm, and (c) keep the rules simple. Making the output visually attractive is a nice-to-have. So we could improve it for this case, but I'm reluctant to take the risk. We don't have enough serialization tests to make such changes risk-free, and when we do make such changes, a lot of serialization tests break and have to be rewritten.

The newline before "SYSTEM" and "PUBLIC" is completely safe, and it will improve the appearance in some cases but not in others: a lot depends on (a) how long the system id and public id are, and (b) on the width of your display window.

Michael Kay
Saxonica

> On 31 Mar 2016, at 20:26, Wolfhart Totschnig <[hidden email]> wrote:
>
> Hello,
>
> There has not been a reply to my question. Is the question too petty? I know that the issue has no real importance, but I would sill be curious to know why Saxon formats the preamble of an xml document the way it does. I would be grateful for a reply.
>
> Wolfhart
>
>
> On 03/28/2016 02:21 PM, Wolfhart Totschnig wrote:
>> Hello,
>>
>> I have a question about how Saxon 9he indents the preamble of an xml document. Let me explain the question with a minimal example.
>>
>> When performing the following identity transformation:
>>
>> <?xml version="1.0" encoding="UTF-8"?>
>> <xsl:stylesheet version="2.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
>>     <xsl:output indent="yes" doctype-system="zettel.dtd"/>
>>     <xsl:template match="node()|@*">
>>         <xsl:copy>
>>             <xsl:apply-templates select="@*|node()"/>
>>         </xsl:copy>
>>     </xsl:template>
>> </xsl:stylesheet>
>>
>> on the following xml document:
>>
>> <?xml version="1.0" encoding="UTF-8"?>
>> <?xml-stylesheet type="text/xsl" href="zettel.xsl"?>
>> <!DOCTYPE zettel SYSTEM "zettel.dtd">
>> <zettel/>
>>
>> Saxon 9he produces the following:
>>
>> <?xml version="1.0" encoding="UTF-8"?>
>> <?xml-stylesheet type="text/xsl" href="zettel.xsl"?><!DOCTYPE zettel
>>   SYSTEM "zettel.dtd">
>> <zettel/>
>>
>> That is, it puts no line break before "<!DOCTYPE" but does but a line break before "SYSTEM". Why is that? This way of indenting the preamble seems counterintuitive and ugly to me. Is this a bug? Or is there a reason for the behavior?
>>
>> Thanks in advance for your help,
>> Wolfhart
>>
>> ------------------------------------------------------------------------------
>> Transform Data into Opportunity.
>> Accelerate data analysis in your applications with
>> Intel Data Analytics Acceleration Library.
>> Click to learn more.
>>
>> http://pubads.g.doubleclick.net/gampad/clk?id=278785471&iu=/4140
>>
>>
>> _______________________________________________
>> saxon-help mailing list archived at
>> http://saxon.markmail.org/
>> [hidden email]
>> https://lists.sourceforge.net/lists/listinfo/saxon-help 
>
> ------------------------------------------------------------------------------
> Transform Data into Opportunity.
> Accelerate data analysis in your applications with
> Intel Data Analytics Acceleration Library.
> Click to learn more.
> http://pubads.g.doubleclick.net/gampad/clk?id=278785471&iu=/4140_______________________________________________
> saxon-help mailing list archived at http://saxon.markmail.org/
> [hidden email]
> https://lists.sourceforge.net/lists/listinfo/saxon-help



------------------------------------------------------------------------------
Transform Data into Opportunity.
Accelerate data analysis in your applications with
Intel Data Analytics Acceleration Library.
Click to learn more.
http://pubads.g.doubleclick.net/gampad/clk?id=278785471&iu=/4140
_______________________________________________
saxon-help mailing list archived at http://saxon.markmail.org/
[hidden email]
https://lists.sourceforge.net/lists/listinfo/saxon-help 
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: indentation of the preamble of an xml document

Wolfhart Totschnig-3
Thank you, Michael, for the clear and detailed reply!
Wolfhart


On 03/31/2016 06:43 PM, Michael Kay wrote:

> It's a surprisingly difficult question to answer, and the reasons are a mixture of design and accident.
>
> The first thing I asked myself was, why is the DOCTYPE declaration output AFTER the xml-stylesheet processing instruction?
>
> Simple answer: because the spec says so. XSLT2/XQuery1 serialization section 5.1.6: "If the doctype-system parameter is specified, the XML output method MUST output a document type declaration immediately before the first element." But as to why it says so, I'm really not sure. Clearly the generated XML would still be well-formed if the DOCTYPE declaration came first. The xml-stylesheet specification allows the PI to come either before or after the DOCTYPE declaration.
>
> For indent=yes, the serialization spec says "Whitespace characters MUST NOT be added other than adjacent to an element node, that is, immediately before a start tag or immediately after an end tag." This means, for example, that whitespace cannot be added between two processing instructions.
>
> This rule probably accounts for the fact that there is no whitespace between the processing instruction and the DOCTYPE declaration; the indentation rules don't allow whitespace to be inserted here. On a strict reading of the spec, indentation whitespace is permitted between the DOCTYPE declaration and the first element, but it is not permitted before the DOCTYPE declaration.
>
> Is there a good reason for this rule in the serialization spec? No, I suspect not. It could have allowed whitespace to be added in this location. But getting the rules right was difficult, and the authors were cautious.
>
> Can whitespace be added before the DOCTYPE declaration on some other basis? Perhaps, on a strict reading, no. Even though such whitespace would be ignored by the XML parser and would not affect the data model it constructs, the serialization spec does not give license to insert it. But there are other cases where Saxon adds such whitespace, for example between an XML declaration and a DOCTYPE declaration.
>
> There are two reasons for adopting a very strict interpretation of the serialization spec. One is that some test suites verify test results by lexical character-by-character comparison of the generated XML. In the current XSLT and XQuery test suites we have moved away from this towards XPath-based test result assertions, but with earlier test suites, it was easier to pass the tests if you stuck very closely to the serialization rules. The second reason for strictness is that the entity being generated might be an external general parsed entity rather than a document entity, and we don't actually know which is intended. In an EGPE, whitespace between the XML declaration (technically, the text declaration) and the first start tag is significant. An EGPE will never have a DOCTYPE declaration, of course, so this rule doesn't prevent generation of whitespace adjacent to the DOCTYPE declaration, but the code is written cautiously because of this possibility.
>
> The XMLEmitter (implementing the XML serialization method in Saxon), immediately before outputting "<!DOCTYPE", has the logic
>
> if (declarationIsWritten && !indenting) {
>      // don't add a newline if indenting, because the indenter will already have done so
>      writer.write("\n");
> }
>
> The effect of this code is that when indenting is off, Saxon does output a newline before the DOCTYPE declaration provided that an XML declaration has been written. The comment indicates that the author of this code thought that the indenter (which operates on the stream of events generated by the transformation before they are passed to the XMLEmitter) will have inserted a newline before the first element, and a second newline is not needed. In fact, in this particular example, this is not the case: the serializer inserts a newline before a start tag only if the previous thing output (not counting whitespace) was a start tag or end tag.
>
> So the explanation here is perhaps that the code structure divides responsibility between the indenter and the emitter, and on this occasion neither has inserted a newline, because both thought the other would do so.
>
> In all of this the policy is (a) conform to the spec, (b) do no harm, and (c) keep the rules simple. Making the output visually attractive is a nice-to-have. So we could improve it for this case, but I'm reluctant to take the risk. We don't have enough serialization tests to make such changes risk-free, and when we do make such changes, a lot of serialization tests break and have to be rewritten.
>
> The newline before "SYSTEM" and "PUBLIC" is completely safe, and it will improve the appearance in some cases but not in others: a lot depends on (a) how long the system id and public id are, and (b) on the width of your display window.
>
> Michael Kay
> Saxonica
>
>> On 31 Mar 2016, at 20:26, Wolfhart Totschnig <[hidden email]> wrote:
>>
>> Hello,
>>
>> There has not been a reply to my question. Is the question too petty? I know that the issue has no real importance, but I would sill be curious to know why Saxon formats the preamble of an xml document the way it does. I would be grateful for a reply.
>>
>> Wolfhart
>>
>>
>> On 03/28/2016 02:21 PM, Wolfhart Totschnig wrote:
>>> Hello,
>>>
>>> I have a question about how Saxon 9he indents the preamble of an xml document. Let me explain the question with a minimal example.
>>>
>>> When performing the following identity transformation:
>>>
>>> <?xml version="1.0" encoding="UTF-8"?>
>>> <xsl:stylesheet version="2.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
>>>      <xsl:output indent="yes" doctype-system="zettel.dtd"/>
>>>      <xsl:template match="node()|@*">
>>>          <xsl:copy>
>>>              <xsl:apply-templates select="@*|node()"/>
>>>          </xsl:copy>
>>>      </xsl:template>
>>> </xsl:stylesheet>
>>>
>>> on the following xml document:
>>>
>>> <?xml version="1.0" encoding="UTF-8"?>
>>> <?xml-stylesheet type="text/xsl" href="zettel.xsl"?>
>>> <!DOCTYPE zettel SYSTEM "zettel.dtd">
>>> <zettel/>
>>>
>>> Saxon 9he produces the following:
>>>
>>> <?xml version="1.0" encoding="UTF-8"?>
>>> <?xml-stylesheet type="text/xsl" href="zettel.xsl"?><!DOCTYPE zettel
>>>    SYSTEM "zettel.dtd">
>>> <zettel/>
>>>
>>> That is, it puts no line break before "<!DOCTYPE" but does but a line break before "SYSTEM". Why is that? This way of indenting the preamble seems counterintuitive and ugly to me. Is this a bug? Or is there a reason for the behavior?
>>>
>>> Thanks in advance for your help,
>>> Wolfhart
>>>
>>> ------------------------------------------------------------------------------
>>> Transform Data into Opportunity.
>>> Accelerate data analysis in your applications with
>>> Intel Data Analytics Acceleration Library.
>>> Click to learn more.
>>>
>>> http://pubads.g.doubleclick.net/gampad/clk?id=278785471&iu=/4140
>>>
>>>
>>> _______________________________________________
>>> saxon-help mailing list archived at
>>> http://saxon.markmail.org/
>>> [hidden email]
>>> https://lists.sourceforge.net/lists/listinfo/saxon-help
>> ------------------------------------------------------------------------------
>> Transform Data into Opportunity.
>> Accelerate data analysis in your applications with
>> Intel Data Analytics Acceleration Library.
>> Click to learn more.
>> http://pubads.g.doubleclick.net/gampad/clk?id=278785471&iu=/4140_______________________________________________
>> saxon-help mailing list archived at http://saxon.markmail.org/
>> [hidden email]
>> https://lists.sourceforge.net/lists/listinfo/saxon-help
>
>
> ------------------------------------------------------------------------------
> Transform Data into Opportunity.
> Accelerate data analysis in your applications with
> Intel Data Analytics Acceleration Library.
> Click to learn more.
> http://pubads.g.doubleclick.net/gampad/clk?id=278785471&iu=/4140
> _______________________________________________
> saxon-help mailing list archived at http://saxon.markmail.org/
> [hidden email]
> https://lists.sourceforge.net/lists/listinfo/saxon-help
>



------------------------------------------------------------------------------
Transform Data into Opportunity.
Accelerate data analysis in your applications with
Intel Data Analytics Acceleration Library.
Click to learn more.
http://pubads.g.doubleclick.net/gampad/clk?id=278785471&iu=/4140
_______________________________________________
saxon-help mailing list archived at http://saxon.markmail.org/
[hidden email]
https://lists.sourceforge.net/lists/listinfo/saxon-help 
Loading...