Quantcast

Override specified encoding

classic Classic list List threaded Threaded
3 messages Options
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Override specified encoding

Andrew Welch
I'm currently batch transforming XML files with Saxon 6.5.4 - the
files specify UTF-8 as the encoding in the prologue but contain the
character 0x92 - which I've read in another post from Mike suggests
the real encoding is Windows-1252.

If I manually change the encoding to Window-1252 everything is ok,
otherwise the parser halts with a fatal error.

-Is there any way using Saxon 6.5.4 to override the encoding in
prolgue with one from the command line? If not, is it possible from
java?

-Is it possible to recover from the fatal error and continue parsing,
ignoring the character?  That would be fine for the time being.

I'm guessing that the files have been edited in something like
Notepad... I would've thought this kind of thing was quite common but
there isn't that much information out there on it.

I did use UltraEdit to do search-and-replace across the directory but
that resolved ampersands and apostrophes etc  Any suggestions for an
xml aware tool that can do that kind of thing would be useful as
well...

thanks,
andrew


-------------------------------------------------------
SF.Net email is Sponsored by the Better Software Conference & EXPO
September 19-22, 2005 * San Francisco, CA * Development Lifecycle Practices
Agile & Plan-Driven Development * Managing Projects & Teams * Testing & QA
Security * Process Improvement & Measurement * http://www.sqe.com/bsce5sf
_______________________________________________
saxon-help mailing list
[hidden email]
https://lists.sourceforge.net/lists/listinfo/saxon-help
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

RE: Override specified encoding

Michael Kay
You could probably do something like this:

InputStreamReader isr = new InputStreamReader(new FileInputStream(filename),
"cp1252");
InputSource is = new InputSource(isr);
SAXSource ss = new SAXSource(is);

Chances are, if you supply a Reader as the input to the XML parser, it will
ignore the encoding declaration contained in the file. Your
InputStreamReader doesn't know anything about XML, so it's going to treat
the file as cp1252 whatever the XML declaration says.

Generally, you need to fix this using tools that are not XML-aware, as any
XML-aware tool is going to tell you that you've got bad data.

If you don't want to write Java, you could transcode the files into UTF-8
using a command-line transcoder such as those at
http://xml.ascc.net/en/utf-8/gluesoft.html or
http://xml.ascc.net/en/utf-8/transcode-index.html - I don't have experience
of these tools.


Michael Kay
http://www.saxonica.com/



> -----Original Message-----
> From: [hidden email]
> [mailto:[hidden email]] On Behalf Of
> andrew welch
> Sent: 06 September 2005 10:05
> To: [hidden email]
> Subject: [saxon] Override specified encoding
>
> I'm currently batch transforming XML files with Saxon 6.5.4 - the
> files specify UTF-8 as the encoding in the prologue but contain the
> character 0x92 - which I've read in another post from Mike suggests
> the real encoding is Windows-1252.
>
> If I manually change the encoding to Window-1252 everything is ok,
> otherwise the parser halts with a fatal error.
>
> -Is there any way using Saxon 6.5.4 to override the encoding in
> prolgue with one from the command line? If not, is it possible from
> java?
>
> -Is it possible to recover from the fatal error and continue parsing,
> ignoring the character?  That would be fine for the time being.
>
> I'm guessing that the files have been edited in something like
> Notepad... I would've thought this kind of thing was quite common but
> there isn't that much information out there on it.
>
> I did use UltraEdit to do search-and-replace across the directory but
> that resolved ampersands and apostrophes etc  Any suggestions for an
> xml aware tool that can do that kind of thing would be useful as
> well...
>
> thanks,
> andrew
>
>
> -------------------------------------------------------
> SF.Net email is Sponsored by the Better Software Conference & EXPO
> September 19-22, 2005 * San Francisco, CA * Development
> Lifecycle Practices
> Agile & Plan-Driven Development * Managing Projects & Teams *
> Testing & QA
> Security * Process Improvement & Measurement *
> http://www.sqe.com/bsce5sf
> _______________________________________________
> saxon-help mailing list
> [hidden email]
> https://lists.sourceforge.net/lists/listinfo/saxon-help
>




-------------------------------------------------------
SF.Net email is Sponsored by the Better Software Conference & EXPO
September 19-22, 2005 * San Francisco, CA * Development Lifecycle Practices
Agile & Plan-Driven Development * Managing Projects & Teams * Testing & QA
Security * Process Improvement & Measurement * http://www.sqe.com/bsce5sf
_______________________________________________
saxon-help mailing list
[hidden email]
https://lists.sourceforge.net/lists/listinfo/saxon-help
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: Override specified encoding

Andrew Welch
On 9/6/05, Michael Kay <[hidden email]> wrote:

> You could probably do something like this:
>
> InputStreamReader isr = new InputStreamReader(new FileInputStream(filename),
> "cp1252");
> InputSource is = new InputSource(isr);
> SAXSource ss = new SAXSource(is);
>
> Chances are, if you supply a Reader as the input to the XML parser, it will
> ignore the encoding declaration contained in the file. Your
> InputStreamReader doesn't know anything about XML, so it's going to treat
> the file as cp1252 whatever the XML declaration says.

Thanks, in a quick test this seems to do the trick.


> Generally, you need to fix this using tools that are not XML-aware, as any
> XML-aware tool is going to tell you that you've got bad data.

That's a very valid point  :)

> If you don't want to write Java, you could transcode the files into UTF-8
> using a command-line transcoder such as those at
> http://xml.ascc.net/en/utf-8/gluesoft.html or
> http://xml.ascc.net/en/utf-8/transcode-index.html - I don't have experience
> of these tools.

They seem to be Unix only for cp1252 (I'll go the java route)


-------------------------------------------------------
SF.Net email is Sponsored by the Better Software Conference & EXPO
September 19-22, 2005 * San Francisco, CA * Development Lifecycle Practices
Agile & Plan-Driven Development * Managing Projects & Teams * Testing & QA
Security * Process Improvement & Measurement * http://www.sqe.com/bsce5sf
_______________________________________________
saxon-help mailing list
[hidden email]
https://lists.sourceforge.net/lists/listinfo/saxon-help
Loading...