Integrating an XML/HTML Tidy function into a Saxon XSLT 3.0 stylesheet?

classic Classic list List threaded Threaded
4 messages Options
Reply | Threaded
Open this post in threaded view
|

Integrating an XML/HTML Tidy function into a Saxon XSLT 3.0 stylesheet?

Sewell, David R. (drs2n)
I am processing files in ONIX XML (a book/journal publishing language, see
http://www.editeur.org/), and need to be able to cope with buggy escaped HTML.
For example, a book description element might come to me like so:

<d104 refname="Text">&lt;P&gt;Tolstoy's &lt;I&gt;War and Peace&lt;/i&gt; is a great book. &lt;P&gt;You should buy it.&lt;/P&gt;</d104>

Unescaped, that would yield

<d104 refname="Text"><P>Tolstoy's <I>War and Peace</i> is a great book. <P>You should buy it.</P></d104>

which is obviously not a well-formed XML fragment.

I have a working preprocessing script in XSLT 3.0 that uses try/catch with
saxon:parse() to determine whether the string content of a d104 element is
parseable as XML. If there was such a thing as a saxon:tidy() extension function
I could use that in the xsl:catch constructor, but there isn't.

Any thoughts on the likeliest strategy for accomplishing what I'm trying to do?
A user-defined extension function that would pass the string to an external
instance of HTML tidy and operate on the result? Or...?

(Unfortunately I'm near-illiterate in Java, so cobbling together something
original is not an option.)

David S.

[Just as a PS to anyone on the list who might be familiar with ONIX XML: I know
that actual XHTML can be used in an ONIX Text element, but getting valid XHTML
into our Press's ONIX XML would first require someone to hand-fix about 500
book entries.]

--
David Sewell, Editorial and Technical Manager
ROTUNDA, The University of Virginia Press
PO Box 400318, Charlottesville, VA 22904-4314 USA
Email: [hidden email]   Tel: +1 434 924 9973
Web: http://rotunda.upress.virginia.edu/

------------------------------------------------------------------------------
_______________________________________________
saxon-help mailing list archived at http://saxon.markmail.org/
[hidden email]
https://lists.sourceforge.net/lists/listinfo/saxon-help 
Reply | Threaded
Open this post in threaded view
|

Re: Integrating an XML/HTML Tidy function into a Saxon XSLT 3.0 stylesheet?

Jirka Kosek
On 22.9.2015 18:04, David Sewell wrote:
> Any thoughts on the likeliest strategy for accomplishing what I'm trying to do?
> A user-defined extension function that would pass the string to an external
> instance of HTML tidy and operate on the result? Or...?

This is already supported, there is saxon:parse-html() function which
uses TagSoup parser:

http://www.saxonica.com/documentation9.5/functions/saxon/parse-html.html

Personally I have used validator.nu HTML5 parser on several projects for
similar purposes.

                        Jirka

--
------------------------------------------------------------------
  Jirka Kosek      e-mail: [hidden email]      http://xmlguru.cz
------------------------------------------------------------------
     Professional XML and Web consulting and training services
DocBook/DITA customization, custom XSLT/XSL-FO document processing
------------------------------------------------------------------
 OASIS DocBook TC member, W3C Invited Expert, ISO JTC1/SC34 rep.
------------------------------------------------------------------
    Bringing you XML Prague conference    http://xmlprague.cz
------------------------------------------------------------------


------------------------------------------------------------------------------

_______________________________________________
saxon-help mailing list archived at http://saxon.markmail.org/
[hidden email]
https://lists.sourceforge.net/lists/listinfo/saxon-help 

signature.asc (203 bytes) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: Integrating an XML/HTML Tidy function into a Saxon XSLT 3.0 stylesheet?

Sewell, David R. (drs2n)
Ah, perfect! I hadn’t realized that saxon:parse-html() did tag cleanup as well. Thanks very much! I just downloaded the TagSoup jarfile and it does exactly what I want.

David

> On Sep 22, 2015, at 12:26 PM, Jirka Kosek <[hidden email]> wrote:
>
> On 22.9.2015 18:04, David Sewell wrote:
>> Any thoughts on the likeliest strategy for accomplishing what I'm trying to do?
>> A user-defined extension function that would pass the string to an external
>> instance of HTML tidy and operate on the result? Or...?
>
> This is already supported, there is saxon:parse-html() function which
> uses TagSoup parser:
>
> http://www.saxonica.com/documentation9.5/functions/saxon/parse-html.html
>
> Personally I have used validator.nu HTML5 parser on several projects for
> similar purposes.
>
> Jirka
>
> --
> ------------------------------------------------------------------
>  Jirka Kosek      e-mail: [hidden email]      http://xmlguru.cz
> ------------------------------------------------------------------
>     Professional XML and Web consulting and training services
> DocBook/DITA customization, custom XSLT/XSL-FO document processing
> ------------------------------------------------------------------
> OASIS DocBook TC member, W3C Invited Expert, ISO JTC1/SC34 rep.
> ------------------------------------------------------------------
>    Bringing you XML Prague conference    http://xmlprague.cz
> ------------------------------------------------------------------
>
> ------------------------------------------------------------------------------
> _______________________________________________
> saxon-help mailing list archived at http://saxon.markmail.org/
> [hidden email]
> https://lists.sourceforge.net/lists/listinfo/saxon-help


------------------------------------------------------------------------------
_______________________________________________
saxon-help mailing list archived at http://saxon.markmail.org/
[hidden email]
https://lists.sourceforge.net/lists/listinfo/saxon-help 
Reply | Threaded
Open this post in threaded view
|

Re: Integrating an XML/HTML Tidy function into a Saxon XSLT 3.0 stylesheet?

Dave Pawson-2
In reply to this post by Sewell, David R. (drs2n)
As Jirka says, Tagsoup is great for this.
He even provides a jar which 'replaces' (builds on) Saxon
to do just what you want.

A very useful addition to Saxon.

Dave


On 22 September 2015 at 17:04, David Sewell <[hidden email]> wrote:

> I am processing files in ONIX XML (a book/journal publishing language, see
> http://www.editeur.org/), and need to be able to cope with buggy escaped HTML.
> For example, a book description element might come to me like so:
>
> <d104 refname="Text">&lt;P&gt;Tolstoy's &lt;I&gt;War and Peace&lt;/i&gt; is a great book. &lt;P&gt;You should buy it.&lt;/P&gt;</d104>
>
> Unescaped, that would yield
>
> <d104 refname="Text"><P>Tolstoy's <I>War and Peace</i> is a great book. <P>You should buy it.</P></d104>
>
> which is obviously not a well-formed XML fragment.
>
> I have a working preprocessing script in XSLT 3.0 that uses try/catch with
> saxon:parse() to determine whether the string content of a d104 element is
> parseable as XML. If there was such a thing as a saxon:tidy() extension function
> I could use that in the xsl:catch constructor, but there isn't.
>
> Any thoughts on the likeliest strategy for accomplishing what I'm trying to do?
> A user-defined extension function that would pass the string to an external
> instance of HTML tidy and operate on the result? Or...?
>
> (Unfortunately I'm near-illiterate in Java, so cobbling together something
> original is not an option.)
>
> David S.
>
> [Just as a PS to anyone on the list who might be familiar with ONIX XML: I know
> that actual XHTML can be used in an ONIX Text element, but getting valid XHTML
> into our Press's ONIX XML would first require someone to hand-fix about 500
> book entries.]
>
> --
> David Sewell, Editorial and Technical Manager
> ROTUNDA, The University of Virginia Press
> PO Box 400318, Charlottesville, VA 22904-4314 USA
> Email: [hidden email]   Tel: +1 434 924 9973
> Web: http://rotunda.upress.virginia.edu/
>
> ------------------------------------------------------------------------------
> _______________________________________________
> saxon-help mailing list archived at http://saxon.markmail.org/
> [hidden email]
> https://lists.sourceforge.net/lists/listinfo/saxon-help



--
Dave Pawson
XSLT XSL-FO FAQ.
Docbook FAQ.
http://www.dpawson.co.uk

------------------------------------------------------------------------------
_______________________________________________
saxon-help mailing list archived at http://saxon.markmail.org/
[hidden email]
https://lists.sourceforge.net/lists/listinfo/saxon-help