WARNING: Lurking bug in normalization code

classic Classic list List threaded Threaded
8 messages Options
Reply | Threaded
Open this post in threaded view
|

WARNING: Lurking bug in normalization code

Colin Paul Adams
It appears to me that the normalization code, which you have adapted
from the sample provided by the Unicode consortium, is incorrect.

There is an explicit assumption in the code that all composition takes
place soley within the BMP. This is not the case. There are some
characters in the supplemntary planes that have canonical
decompositions (which in turn are within the same block) - the musical
notation characters at least.
I don't think there are any other code points that violate this
assumption, but I'm not certain.
If I'm right, then Saxon users are unlikely to suffer from this
problem, unless someone out there is using the musical notation
characters, and outputting in NFC or NFKC.

I've emailed the author of this code (Mark Davis, who is also the
author of UAX #15), alerting him to the problem.

Note that the file NormalizationTest.txt allows generation of a
comprehensive test suite for normalization software. It was because I
was doing this for my Eiffel implementation of Unicode normalization,
that I came across this error. I had seen the explicit assumption in
the code, and so coded it as a pre-condition for a routine to
calculate the key to the composition table.
It is possible to use Design-by-Contract in Java code, I
believe. There is something called JML (Java Modelling Language). I
don't know anything about it other than it's existence though.

This highlights the need for a good  test suite. Unfortuantely, it's
not possible to provide a comprehensive test coverage for XSLT.
--
Colin Adams
Preston Lancashire


-------------------------------------------------------
SF.Net email is sponsored by:
Tame your development challenges with Apache's Geronimo App Server. Download
it for free - -and be entered to win a 42" plasma tv or your very own
Sony(tm)PSP.  Click here to play: http://sourceforge.net/geronimo.php
_______________________________________________
saxon-help mailing list
[hidden email]
https://lists.sourceforge.net/lists/listinfo/saxon-help
Reply | Threaded
Open this post in threaded view
|

RE: WARNING: Lurking bug in normalization code

Michael Kay
I did fix some related bugs in the normalization code published by the
Unicode consortium in the Saxon version; I also notified the problems to the
originator.

It's possible, of course, that I didn't find all the problems, and if you
can identify specific test cases that Saxon gets wrong then I'll be glad to
hear about them. (Well, I won't open the champagne - but you know what I
mean!)

Michael Kay
http://www.saxonica.com/

 

> -----Original Message-----
> From: [hidden email]
> [mailto:[hidden email]] On Behalf Of
> Colin Paul Adams
> Sent: 04 November 2005 08:49
> To: [hidden email]
> Subject: [saxon] WARNING: Lurking bug in normalization code
>
> It appears to me that the normalization code, which you have adapted
> from the sample provided by the Unicode consortium, is incorrect.
>
> There is an explicit assumption in the code that all composition takes
> place soley within the BMP. This is not the case. There are some
> characters in the supplemntary planes that have canonical
> decompositions (which in turn are within the same block) - the musical
> notation characters at least.
> I don't think there are any other code points that violate this
> assumption, but I'm not certain.
> If I'm right, then Saxon users are unlikely to suffer from this
> problem, unless someone out there is using the musical notation
> characters, and outputting in NFC or NFKC.
>
> I've emailed the author of this code (Mark Davis, who is also the
> author of UAX #15), alerting him to the problem.
>
> Note that the file NormalizationTest.txt allows generation of a
> comprehensive test suite for normalization software. It was because I
> was doing this for my Eiffel implementation of Unicode normalization,
> that I came across this error. I had seen the explicit assumption in
> the code, and so coded it as a pre-condition for a routine to
> calculate the key to the composition table.
> It is possible to use Design-by-Contract in Java code, I
> believe. There is something called JML (Java Modelling Language). I
> don't know anything about it other than it's existence though.
>
> This highlights the need for a good  test suite. Unfortuantely, it's
> not possible to provide a comprehensive test coverage for XSLT.
> --
> Colin Adams
> Preston Lancashire
>
>
> -------------------------------------------------------
> SF.Net email is sponsored by:
> Tame your development challenges with Apache's Geronimo App
> Server. Download
> it for free - -and be entered to win a 42" plasma tv or your very own
> Sony(tm)PSP.  Click here to play: http://sourceforge.net/geronimo.php
> _______________________________________________
> saxon-help mailing list
> [hidden email]
> https://lists.sourceforge.net/lists/listinfo/saxon-help
>




-------------------------------------------------------
SF.Net email is sponsored by:
Tame your development challenges with Apache's Geronimo App Server. Download
it for free - -and be entered to win a 42" plasma tv or your very own
Sony(tm)PSP.  Click here to play: http://sourceforge.net/geronimo.php
_______________________________________________
saxon-help mailing list
[hidden email]
https://lists.sourceforge.net/lists/listinfo/saxon-help
Reply | Threaded
Open this post in threaded view
|

Re: WARNING: Lurking bug in normalization code

Elliotte Harold
In reply to this post by Colin Paul Adams
Colin Paul Adams wrote:

> There is an explicit assumption in the code that all composition takes
> place soley within the BMP. This is not the case. There are some
> characters in the supplemntary planes that have canonical
> decompositions (which in turn are within the same block) - the musical
> notation characters at least.
> I don't think there are any other code points that violate this
> assumption, but I'm not certain.
> If I'm right, then Saxon users are unlikely to suffer from this
> problem, unless someone out there is using the musical notation
> characters, and outputting in NFC or NFKC.

For my own use in my own normalization code in XOM, I'd be very
interested in any test cases you've come up with for this. My experience
has been that the Unicode consortium test cases are useful but
incomplete. It's relatively easy to pass them and still have significant
bugs in your implementation. At least my initial implementation had
significant bugs that the Unicode test cases did not catch. I would not
be surprised to discover that I still have some bugs in the handling of
characters from beyond the BMP.

There's also an issue here that Unicode normalization is generally done
against a specific version of the Unicode data. i.e. an implementation
that does correct normalization for 3.0 may not be correct for 4.0. A
correct 4.0 implementation will probably not be correct for 4.1. Hmm,
section 7.4.6 fn:normalize-unicode of the newly released CR for XPath
functions and operators does not seem to address this point. Is the
Unicode version specified anywhere else in the drafts? or is there
anything else on point here? If not, then I think a formal comment to
[hidden email] might be called for asking them to clarify
this point, either by picking a version of Unicode which must be
supported or asking implementations to specify which version they
support or some such.

Perhaps this discussion could/should be moved to the Unicode mailing list?

> This highlights the need for a good  test suite. Unfortuantely, it's
> not possible to provide a comprehensive test coverage for XSLT.

Really? Why? I don't see any reason this would be harder for XSLT than
Java or Python or anything else.

--
Elliotte Rusty Harold  [hidden email]
XML in a Nutshell 3rd Edition Just Published!
http://www.cafeconleche.org/books/xian3/
http://www.amazon.com/exec/obidos/ISBN=0596007647/cafeaulaitA/ref=nosim


-------------------------------------------------------
SF.Net email is sponsored by:
Tame your development challenges with Apache's Geronimo App Server. Download
it for free - -and be entered to win a 42" plasma tv or your very own
Sony(tm)PSP.  Click here to play: http://sourceforge.net/geronimo.php
_______________________________________________
saxon-help mailing list
[hidden email]
https://lists.sourceforge.net/lists/listinfo/saxon-help
Reply | Threaded
Open this post in threaded view
|

Re: WARNING: Lurking bug in normalization code

Colin Paul Adams
In reply to this post by Michael Kay
>>>>> "Michael" == Michael Kay <[hidden email]> writes:

    Michael> It's possible, of course, that I didn't find all the
    Michael> problems, and if you can identify specific test cases
    Michael> that Saxon gets wrong then I'll be glad to hear about
    Michael> them. (Well, I won't open the champagne - but you know
    Michael> what I mean!)

Well, I haven't tested anything - I just looked at

  /**
    * Returns the composite of the two characters. If the two
    * characters don't combine, returns NOT_COMPOSITE.
    * Only has to worry about BMP characters, since those are the only ones that can ever compose.
    * @param   first   first character (e.g. 'c')
    * @param   first   second character (e.g. '¸' cedilla)
    * @return          composite (e.g. 'ç')
    */
    public char getPairwiseComposition(int first, int second) {
    if (first < 0 || first > 0x10FFFF || second < 0 || second > 0x10FFFF) return NOT_COMPOSITE;
        return (char)compose.get((first << 16) | second);
    }

and read the comment.

But I now see that the code actually allows any Unicode character, not
just BMP characters, so it looks like only the comment is wrong.
And I reckon the if statement might as well be removed.
--
Colin Adams
Preston Lancashire


-------------------------------------------------------
SF.Net email is sponsored by:
Tame your development challenges with Apache's Geronimo App Server. Download
it for free - -and be entered to win a 42" plasma tv or your very own
Sony(tm)PSP.  Click here to play: http://sourceforge.net/geronimo.php
_______________________________________________
saxon-help mailing list
[hidden email]
https://lists.sourceforge.net/lists/listinfo/saxon-help
Reply | Threaded
Open this post in threaded view
|

RE: WARNING: Lurking bug in normalization code

Michael Kay
In reply to this post by Elliotte Harold
> There's also an issue here that Unicode normalization is
> generally done
> against a specific version of the Unicode data. i.e. an
> implementation
> that does correct normalization for 3.0 may not be correct for 4.0. A
> correct 4.0 implementation will probably not be correct for 4.1. Hmm,
> section 7.4.6 fn:normalize-unicode of the newly released CR for XPath
> functions and operators does not seem to address this point. Is the
> Unicode version specified anywhere else in the drafts? or is there
> anything else on point here?

This is covered at

http://www.w3.org/TR/xpath-functions/#conformance

Michael Kay
http://www.saxonica.com/




-------------------------------------------------------
SF.Net email is sponsored by:
Tame your development challenges with Apache's Geronimo App Server. Download
it for free - -and be entered to win a 42" plasma tv or your very own
Sony(tm)PSP.  Click here to play: http://sourceforge.net/geronimo.php
_______________________________________________
saxon-help mailing list
[hidden email]
https://lists.sourceforge.net/lists/listinfo/saxon-help
Reply | Threaded
Open this post in threaded view
|

RE: WARNING: Lurking bug in normalization code

Michael Kay
In reply to this post by Colin Paul Adams
>
> Well, I haven't tested anything - I just looked at
>
>   /**
>     * Returns the composite of the two characters. If the two
>     * characters don't combine, returns NOT_COMPOSITE.
>     * Only has to worry about BMP characters, since those are
> the only ones that can ever compose.
>     * @param   first   first character (e.g. 'c')
>     * @param   first   second character (e.g. '¸' cedilla)
>     * @return          composite (e.g. 'ç')
>     */
>     public char getPairwiseComposition(int first, int second) {
>     if (first < 0 || first > 0x10FFFF || second < 0 ||
> second > 0x10FFFF) return NOT_COMPOSITE;
>         return (char)compose.get((first << 16) | second);
>     }
>
> and read the comment.
>
> But I now see that the code actually allows any Unicode character, not
> just BMP characters, so it looks like only the comment is wrong.

Actually, no - this line is distinctly iffy:

          return (char)compose.get((first << 16) | second);

Michael Kay




-------------------------------------------------------
SF.Net email is sponsored by:
Tame your development challenges with Apache's Geronimo App Server. Download
it for free - -and be entered to win a 42" plasma tv or your very own
Sony(tm)PSP.  Click here to play: http://sourceforge.net/geronimo.php
_______________________________________________
saxon-help mailing list
[hidden email]
https://lists.sourceforge.net/lists/listinfo/saxon-help
Reply | Threaded
Open this post in threaded view
|

Re: WARNING: Lurking bug in normalization code

Colin Paul Adams
>>>>> "Michael" == Michael Kay <[hidden email]> writes:

    Michael> Actually, no - this line is distinctly iffy:

    Michael>           return (char)compose.get((first << 16) |
    Michael> second);

Yeah, I wondered about that - but not being all that familiar with
Java hash tables, I wasn't sure if the key needed to be unique or not.

Myself, I simply use a pair of integers as a key.

A simple test case is the following:

c1 = 1D15E
c2 = 1D157 1D165
c3 = 1D157 1D165
c4 = 1D157 1D165
c5 - 1D157 1D165

#    NFC
#      c2 ==  NFC(c1) ==  NFC(c2) ==  NFC(c3)
#      c4 ==  NFC(c4) ==  NFC(c5)
#
#    NFD
#      c3 ==  NFD(c1) ==  NFD(c2) ==  NFD(c3)
#      c5 ==  NFD(c4) ==  NFD(c5)
#
#    NFKC
#      c4 == NFKC(c1) == NFKC(c2) == NFKC(c3) == NFKC(c4) == NFKC(c5)
#
#    NFKD
#      c5 == NFKD(c1) == NFKD(c2) == NFKD(c3) == NFKD(c4) == NFKD(c5)
#
 
--
Colin Adams
Preston Lancashire


-------------------------------------------------------
SF.Net email is sponsored by:
Tame your development challenges with Apache's Geronimo App Server. Download
it for free - -and be entered to win a 42" plasma tv or your very own
Sony(tm)PSP.  Click here to play: http://sourceforge.net/geronimo.php
_______________________________________________
saxon-help mailing list
[hidden email]
https://lists.sourceforge.net/lists/listinfo/saxon-help
Reply | Threaded
Open this post in threaded view
|

Re: WARNING: Lurking bug in normalization code

Colin Paul Adams
In reply to this post by Michael Kay
>>>>> "Michael" == Michael Kay <[hidden email]> writes:

    Michael> Actually, no - this line is distinctly iffy:

    Michael>           return (char)compose.get((first << 16) |
    Michael> second);

Actually, it's not!

I got this reply from Mark Davis:

>The compositions are limited to those available in Unicode 3.0, thus
>excluding any non-BMP characters.

>See http://www.unicode.org/reports/tr15/#Primary_Exclusion_List_Table

--
Colin Adams
Preston Lancashire


-------------------------------------------------------
SF.Net email is sponsored by:
Tame your development challenges with Apache's Geronimo App Server. Download
it for free - -and be entered to win a 42" plasma tv or your very own
Sony(tm)PSP.  Click here to play: http://sourceforge.net/geronimo.php
_______________________________________________
saxon-help mailing list
[hidden email]
https://lists.sourceforge.net/lists/listinfo/saxon-help