Parsing non-well-formed XML from within an XSLT

classic Classic list List threaded Threaded
6 messages Options
Reply | Threaded
Open this post in threaded view
|

Parsing non-well-formed XML from within an XSLT

Eliot Kimber
I have an XSLT that is parsing HTML documents in the process of a larger XSLT process (text descriptions for graphics referenced by the incoming XML. Some of these HTML files are not well formed and so don't parse. They are valid HTML but usually have mismatched case for the start and end tags (<B>...</b>). 

I'm pretty sure the answer is "no", but does Saxon provide a way to parse non-well-formed HTML as HTML? I was looking for something like parseHtml() but didn't see it in the Saxon extensions.

If this was for the main input document I'd know that I have to configure my own XML Reader. But since in this case I'm going through document(), I'm not sure what I could do.

For this data it'll probably be easier to just fix the files, but I wanted to make sure there wasn't some simple Saxon-provided solution I was overlooking.

Thanks,

Eliot
----
Eliot Kimber
Austin, TX

------------------------------------------------------------------------------
Dive into the World of Parallel Programming. The Go Parallel Website,
sponsored by Intel and developed in partnership with Slashdot Media, is your
hub for all things parallel software development, from weekly thought
leadership blogs to news, videos, case studies, tutorials and more. Take a
look and join the conversation now. http://goparallel.sourceforge.net/
_______________________________________________
saxon-help mailing list archived at http://saxon.markmail.org/
[hidden email]
https://lists.sourceforge.net/lists/listinfo/saxon-help 
Reply | Threaded
Open this post in threaded view
|

Re: Parsing non-well-formed XML from within an XSLT

Michael Kay
Your wish is my command...


Michael Kay
Saxonica
+44 (0) 118 946 5893




On 31 Jan 2015, at 21:53, Eliot Kimber <[hidden email]> wrote:

I have an XSLT that is parsing HTML documents in the process of a larger XSLT process (text descriptions for graphics referenced by the incoming XML. Some of these HTML files are not well formed and so don't parse. They are valid HTML but usually have mismatched case for the start and end tags (<B>...</b>). 

I'm pretty sure the answer is "no", but does Saxon provide a way to parse non-well-formed HTML as HTML? I was looking for something like parseHtml() but didn't see it in the Saxon extensions.

If this was for the main input document I'd know that I have to configure my own XML Reader. But since in this case I'm going through document(), I'm not sure what I could do.

For this data it'll probably be easier to just fix the files, but I wanted to make sure there wasn't some simple Saxon-provided solution I was overlooking.

Thanks,

Eliot
----
Eliot Kimber
Austin, TX
------------------------------------------------------------------------------
Dive into the World of Parallel Programming. The Go Parallel Website,
sponsored by Intel and developed in partnership with Slashdot Media, is your
hub for all things parallel software development, from weekly thought
leadership blogs to news, videos, case studies, tutorials and more. Take a
look and join the conversation now. http://goparallel.sourceforge.net/_______________________________________________
saxon-help mailing list archived at http://saxon.markmail.org/
[hidden email]
https://lists.sourceforge.net/lists/listinfo/saxon-help


------------------------------------------------------------------------------
Dive into the World of Parallel Programming. The Go Parallel Website,
sponsored by Intel and developed in partnership with Slashdot Media, is your
hub for all things parallel software development, from weekly thought
leadership blogs to news, videos, case studies, tutorials and more. Take a
look and join the conversation now. http://goparallel.sourceforge.net/
_______________________________________________
saxon-help mailing list archived at http://saxon.markmail.org/
[hidden email]
https://lists.sourceforge.net/lists/listinfo/saxon-help 
Reply | Threaded
Open this post in threaded view
|

Re: Parsing non-well-formed XML from within an XSLT

Eliot Kimber
Excellent. Unfortunately, the DITA OT version I'm using still ships with Saxon 9.1.*, but I think we can manage to upgrade it for this client.

Cheers,

Eliot
 
Eliot Kimber
Austin, TX


On Saturday, January 31, 2015 5:10 PM, Michael Kay <[hidden email]> wrote:


Your wish is my command...


Michael Kay
Saxonica
+44 (0) 118 946 5893




On 31 Jan 2015, at 21:53, Eliot Kimber <[hidden email]> wrote:

I have an XSLT that is parsing HTML documents in the process of a larger XSLT process (text descriptions for graphics referenced by the incoming XML. Some of these HTML files are not well formed and so don't parse. They are valid HTML but usually have mismatched case for the start and end tags (<B>...</b>). 

I'm pretty sure the answer is "no", but does Saxon provide a way to parse non-well-formed HTML as HTML? I was looking for something like parseHtml() but didn't see it in the Saxon extensions.

If this was for the main input document I'd know that I have to configure my own XML Reader. But since in this case I'm going through document(), I'm not sure what I could do.

For this data it'll probably be easier to just fix the files, but I wanted to make sure there wasn't some simple Saxon-provided solution I was overlooking.

Thanks,

Eliot
----
Eliot Kimber
Austin, TX
------------------------------------------------------------------------------
Dive into the World of Parallel Programming. The Go Parallel Website,
sponsored by Intel and developed in partnership with Slashdot Media, is your
hub for all things parallel software development, from weekly thought
leadership blogs to news, videos, case studies, tutorials and more. Take a
look and join the conversation now. http://goparallel.sourceforge.net/_______________________________________________
saxon-help mailing list archived at http://saxon.markmail.org/
[hidden email]
https://lists.sourceforge.net/lists/listinfo/saxon-help




------------------------------------------------------------------------------
Dive into the World of Parallel Programming. The Go Parallel Website,
sponsored by Intel and developed in partnership with Slashdot Media, is your
hub for all things parallel software development, from weekly thought
leadership blogs to news, videos, case studies, tutorials and more. Take a
look and join the conversation now. http://goparallel.sourceforge.net/
_______________________________________________
saxon-help mailing list archived at http://saxon.markmail.org/
[hidden email]
https://lists.sourceforge.net/lists/listinfo/saxon-help 
Reply | Threaded
Open this post in threaded view
|

Re: Parsing non-well-formed XML from within an XSLT

Michael Kay
If you want a do-it-yourself solution, you can write a URIResolver that processes doc('*.html') by creating a SAXSource initialised using setXMLReader(htmlParser) where htmlParser can be TagSoup or validator.nu.

Michael Kay
Saxonica
+44 (0) 118 946 5893




On 1 Feb 2015, at 17:24, Eliot Kimber <[hidden email]> wrote:

Excellent. Unfortunately, the DITA OT version I'm using still ships with Saxon 9.1.*, but I think we can manage to upgrade it for this client.

Cheers,

Eliot
 
Eliot Kimber
Austin, TX


On Saturday, January 31, 2015 5:10 PM, Michael Kay <[hidden email]> wrote:


Your wish is my command...


Michael Kay
Saxonica
+44 (0) 118 946 5893




On 31 Jan 2015, at 21:53, Eliot Kimber <[hidden email]> wrote:

I have an XSLT that is parsing HTML documents in the process of a larger XSLT process (text descriptions for graphics referenced by the incoming XML. Some of these HTML files are not well formed and so don't parse. They are valid HTML but usually have mismatched case for the start and end tags (<B>...</b>). 

I'm pretty sure the answer is "no", but does Saxon provide a way to parse non-well-formed HTML as HTML? I was looking for something like parseHtml() but didn't see it in the Saxon extensions.

If this was for the main input document I'd know that I have to configure my own XML Reader. But since in this case I'm going through document(), I'm not sure what I could do.

For this data it'll probably be easier to just fix the files, but I wanted to make sure there wasn't some simple Saxon-provided solution I was overlooking.

Thanks,

Eliot
----
Eliot Kimber
Austin, TX
------------------------------------------------------------------------------
Dive into the World of Parallel Programming. The Go Parallel Website,
sponsored by Intel and developed in partnership with Slashdot Media, is your
hub for all things parallel software development, from weekly thought
leadership blogs to news, videos, case studies, tutorials and more. Take a
look and join the conversation now. http://goparallel.sourceforge.net/_______________________________________________
saxon-help mailing list archived at http://saxon.markmail.org/
[hidden email]
https://lists.sourceforge.net/lists/listinfo/saxon-help



------------------------------------------------------------------------------
Dive into the World of Parallel Programming. The Go Parallel Website,
sponsored by Intel and developed in partnership with Slashdot Media, is your
hub for all things parallel software development, from weekly thought
leadership blogs to news, videos, case studies, tutorials and more. Take a
look and join the conversation now. http://goparallel.sourceforge.net/_______________________________________________
saxon-help mailing list archived at http://saxon.markmail.org/
[hidden email]
https://lists.sourceforge.net/lists/listinfo/saxon-help


------------------------------------------------------------------------------
Dive into the World of Parallel Programming. The Go Parallel Website,
sponsored by Intel and developed in partnership with Slashdot Media, is your
hub for all things parallel software development, from weekly thought
leadership blogs to news, videos, case studies, tutorials and more. Take a
look and join the conversation now. http://goparallel.sourceforge.net/
_______________________________________________
saxon-help mailing list archived at http://saxon.markmail.org/
[hidden email]
https://lists.sourceforge.net/lists/listinfo/saxon-help 
Reply | Threaded
Open this post in threaded view
|

Re: Parsing non-well-formed XML from within an XSLT

Rob Koberg-2
You can also just use the tagsoup as your parser. Something like:


  $ export CLASSPATH=$CLASSPATH:saxon9he.jar:tagsoup-1.2.1.jar
  $ java net.sf.saxon.Transform -o:teacher-guides.html
-s:assets-list.xml -xsl:well-former.xsl
-x:org.ccil.cowan.tagsoup.Parser

On Sun, Feb 1, 2015 at 11:09 AM, Michael Kay <[hidden email]> wrote:

> If you want a do-it-yourself solution, you can write a URIResolver that
> processes doc('*.html') by creating a SAXSource initialised using
> setXMLReader(htmlParser) where htmlParser can be TagSoup or validator.nu.
>
> Michael Kay
> Saxonica
> [hidden email]
> +44 (0) 118 946 5893
>
>
>
>
> On 1 Feb 2015, at 17:24, Eliot Kimber <[hidden email]> wrote:
>
> Excellent. Unfortunately, the DITA OT version I'm using still ships with
> Saxon 9.1.*, but I think we can manage to upgrade it for this client.
>
> Cheers,
>
> Eliot
>
> Eliot Kimber
> Austin, TX
>
>
> On Saturday, January 31, 2015 5:10 PM, Michael Kay <[hidden email]>
> wrote:
>
>
> Your wish is my command...
>
> http://www.saxonica.com/documentation/#!functions/saxon/parse-html
>
> Michael Kay
> Saxonica
> [hidden email]
> +44 (0) 118 946 5893
>
>
>
>
> On 31 Jan 2015, at 21:53, Eliot Kimber <[hidden email]> wrote:
>
> I have an XSLT that is parsing HTML documents in the process of a larger
> XSLT process (text descriptions for graphics referenced by the incoming XML.
> Some of these HTML files are not well formed and so don't parse. They are
> valid HTML but usually have mismatched case for the start and end tags
> (<B>...</b>).
>
> I'm pretty sure the answer is "no", but does Saxon provide a way to parse
> non-well-formed HTML as HTML? I was looking for something like parseHtml()
> but didn't see it in the Saxon extensions.
>
> If this was for the main input document I'd know that I have to configure my
> own XML Reader. But since in this case I'm going through document(), I'm not
> sure what I could do.
>
> For this data it'll probably be easier to just fix the files, but I wanted
> to make sure there wasn't some simple Saxon-provided solution I was
> overlooking.
>
> Thanks,
>
> Eliot
> ----
> Eliot Kimber
> Austin, TX
> ------------------------------------------------------------------------------
> Dive into the World of Parallel Programming. The Go Parallel Website,
> sponsored by Intel and developed in partnership with Slashdot Media, is your
> hub for all things parallel software development, from weekly thought
> leadership blogs to news, videos, case studies, tutorials and more. Take a
> look and join the conversation now.
> http://goparallel.sourceforge.net/_______________________________________________
> saxon-help mailing list archived at http://saxon.markmail.org/
> [hidden email]
> https://lists.sourceforge.net/lists/listinfo/saxon-help
>
>
>
>
> ------------------------------------------------------------------------------
> Dive into the World of Parallel Programming. The Go Parallel Website,
> sponsored by Intel and developed in partnership with Slashdot Media, is your
> hub for all things parallel software development, from weekly thought
> leadership blogs to news, videos, case studies, tutorials and more. Take a
> look and join the conversation now.
> http://goparallel.sourceforge.net/_______________________________________________
> saxon-help mailing list archived at http://saxon.markmail.org/
> [hidden email]
> https://lists.sourceforge.net/lists/listinfo/saxon-help
>
>
>
> ------------------------------------------------------------------------------
> Dive into the World of Parallel Programming. The Go Parallel Website,
> sponsored by Intel and developed in partnership with Slashdot Media, is your
> hub for all things parallel software development, from weekly thought
> leadership blogs to news, videos, case studies, tutorials and more. Take a
> look and join the conversation now. http://goparallel.sourceforge.net/
> _______________________________________________
> saxon-help mailing list archived at http://saxon.markmail.org/
> [hidden email]
> https://lists.sourceforge.net/lists/listinfo/saxon-help

------------------------------------------------------------------------------
Dive into the World of Parallel Programming. The Go Parallel Website,
sponsored by Intel and developed in partnership with Slashdot Media, is your
hub for all things parallel software development, from weekly thought
leadership blogs to news, videos, case studies, tutorials and more. Take a
look and join the conversation now. http://goparallel.sourceforge.net/
_______________________________________________
saxon-help mailing list archived at http://saxon.markmail.org/
[hidden email]
https://lists.sourceforge.net/lists/listinfo/saxon-help 
Reply | Threaded
Open this post in threaded view
|

Re: Parsing non-well-formed XML from within an XSLT

Dave Pawson-2
Thanks Rob, minor variant on using tagsoup version of saxon, available
on his website.

regards

On 1 February 2015 at 22:38, Rob Koberg <[hidden email]> wrote:

> You can also just use the tagsoup as your parser. Something like:
>
>
>   $ export CLASSPATH=$CLASSPATH:saxon9he.jar:tagsoup-1.2.1.jar
>   $ java net.sf.saxon.Transform -o:teacher-guides.html
> -s:assets-list.xml -xsl:well-former.xsl
> -x:org.ccil.cowan.tagsoup.Parser
>
> On Sun, Feb 1, 2015 at 11:09 AM, Michael Kay <[hidden email]> wrote:
>> If you want a do-it-yourself solution, you can write a URIResolver that
>> processes doc('*.html') by creating a SAXSource initialised using
>> setXMLReader(htmlParser) where htmlParser can be TagSoup or validator.nu.
>>
>> Michael Kay
>> Saxonica
>> [hidden email]
>> +44 (0) 118 946 5893
>>
>>
>>
>>
>> On 1 Feb 2015, at 17:24, Eliot Kimber <[hidden email]> wrote:
>>
>> Excellent. Unfortunately, the DITA OT version I'm using still ships with
>> Saxon 9.1.*, but I think we can manage to upgrade it for this client.
>>
>> Cheers,
>>
>> Eliot
>>
>> Eliot Kimber
>> Austin, TX
>>
>>
>> On Saturday, January 31, 2015 5:10 PM, Michael Kay <[hidden email]>
>> wrote:
>>
>>
>> Your wish is my command...
>>
>> http://www.saxonica.com/documentation/#!functions/saxon/parse-html
>>
>> Michael Kay
>> Saxonica
>> [hidden email]
>> +44 (0) 118 946 5893
>>
>>
>>
>>
>> On 31 Jan 2015, at 21:53, Eliot Kimber <[hidden email]> wrote:
>>
>> I have an XSLT that is parsing HTML documents in the process of a larger
>> XSLT process (text descriptions for graphics referenced by the incoming XML.
>> Some of these HTML files are not well formed and so don't parse. They are
>> valid HTML but usually have mismatched case for the start and end tags
>> (<B>...</b>).
>>
>> I'm pretty sure the answer is "no", but does Saxon provide a way to parse
>> non-well-formed HTML as HTML? I was looking for something like parseHtml()
>> but didn't see it in the Saxon extensions.
>>
>> If this was for the main input document I'd know that I have to configure my
>> own XML Reader. But since in this case I'm going through document(), I'm not
>> sure what I could do.
>>
>> For this data it'll probably be easier to just fix the files, but I wanted
>> to make sure there wasn't some simple Saxon-provided solution I was
>> overlooking.
>>
>> Thanks,
>>
>> Eliot
>> ----
>> Eliot Kimber
>> Austin, TX
>> ------------------------------------------------------------------------------
>> Dive into the World of Parallel Programming. The Go Parallel Website,
>> sponsored by Intel and developed in partnership with Slashdot Media, is your
>> hub for all things parallel software development, from weekly thought
>> leadership blogs to news, videos, case studies, tutorials and more. Take a
>> look and join the conversation now.
>> http://goparallel.sourceforge.net/_______________________________________________
>> saxon-help mailing list archived at http://saxon.markmail.org/
>> [hidden email]
>> https://lists.sourceforge.net/lists/listinfo/saxon-help
>>
>>
>>
>>
>> ------------------------------------------------------------------------------
>> Dive into the World of Parallel Programming. The Go Parallel Website,
>> sponsored by Intel and developed in partnership with Slashdot Media, is your
>> hub for all things parallel software development, from weekly thought
>> leadership blogs to news, videos, case studies, tutorials and more. Take a
>> look and join the conversation now.
>> http://goparallel.sourceforge.net/_______________________________________________
>> saxon-help mailing list archived at http://saxon.markmail.org/
>> [hidden email]
>> https://lists.sourceforge.net/lists/listinfo/saxon-help
>>
>>
>>
>> ------------------------------------------------------------------------------
>> Dive into the World of Parallel Programming. The Go Parallel Website,
>> sponsored by Intel and developed in partnership with Slashdot Media, is your
>> hub for all things parallel software development, from weekly thought
>> leadership blogs to news, videos, case studies, tutorials and more. Take a
>> look and join the conversation now. http://goparallel.sourceforge.net/
>> _______________________________________________
>> saxon-help mailing list archived at http://saxon.markmail.org/
>> [hidden email]
>> https://lists.sourceforge.net/lists/listinfo/saxon-help
>
> ------------------------------------------------------------------------------
> Dive into the World of Parallel Programming. The Go Parallel Website,
> sponsored by Intel and developed in partnership with Slashdot Media, is your
> hub for all things parallel software development, from weekly thought
> leadership blogs to news, videos, case studies, tutorials and more. Take a
> look and join the conversation now. http://goparallel.sourceforge.net/
> _______________________________________________
> saxon-help mailing list archived at http://saxon.markmail.org/
> [hidden email]
> https://lists.sourceforge.net/lists/listinfo/saxon-help



--
Dave Pawson
XSLT XSL-FO FAQ.
Docbook FAQ.
http://www.dpawson.co.uk

------------------------------------------------------------------------------
Dive into the World of Parallel Programming. The Go Parallel Website,
sponsored by Intel and developed in partnership with Slashdot Media, is your
hub for all things parallel software development, from weekly thought
leadership blogs to news, videos, case studies, tutorials and more. Take a
look and join the conversation now. http://goparallel.sourceforge.net/
_______________________________________________
saxon-help mailing list archived at http://saxon.markmail.org/
[hidden email]
https://lists.sourceforge.net/lists/listinfo/saxon-help