Localization and the Comparer class

  • Follow


Hi!

Below is a simple program that is using the Comparer class to compare two 
strings named str1 and str2.
If I use the 0x040A as the first argument to the CultureInfo I use the 
traditional sort order accoding to the MSDN documentation that you can find 
at the bottom.
The WriteLine statement in the program is writing 1 as the value meaning 
that str1 > str2.
Can somebody explain how this works because the comparing is not based on 
the ascii table  ?
I mean if we use the normal ascii table we would have said that str1 < str2 
because the letter l is less then u.

public static void Main()
{
      // Creates the strings to compare.
      String str1 = "llegar";
      String str2 = "lugar";

      Comparer myCompTrad = new Comparer(new CultureInfo(0x040A, false));
      Console.WriteLine("   Traditional Sort  : {0}", 
myCompTrad.Compare(str1, str2));
}

The Spanish (Spain) culture uses two culture identifiers, 0x0C0A using the 
default international sort order, and 0x040A using the traditional sort 
order. If the CultureInfo is constructed using the es-ES culture name, the 
new CultureInfo uses the default international sort order. For the 
traditional sort order, the object is constructed using the name 
es-ES_tradnl.

//Tony 


0
Reply Tony 2/9/2010 11:21:09 PM

Tony Johansson wrote:
> Hi!
> 
> Below is a simple program that is using the Comparer class to compare two 
> strings named str1 and str2.
> If I use the 0x040A as the first argument to the CultureInfo I use the 
> traditional sort order accoding to the MSDN documentation that you can find 
> at the bottom.

At the bottom of what?

> The WriteLine statement in the program is writing 1 as the value meaning 
> that str1 > str2.
> Can somebody explain how this works because the comparing is not based on 
> the ascii table  ?

What do you want to know?  If you want all the gory details of the 
comparison, you need to just look at the implementation (which may or 
may not involve diving into the unmanaged Windows API).

The basic answer is: duh, of course a culture-specific comparison must 
not be based on the ASCII character values.  That's the whole point of a 
culture-specific comparison, as ASCII is itself not a culturally-based 
character encoding.

Instead, when you do a culture-specific comparison, it uses whatever 
ordering rules exist for that specific culture.  Humans being the kind 
of animal they are, these rules aren't always logical.  Even when they 
are logical, the logic does not necessarily follow the representation of 
characters and words as found in a computer.

But, those rules _are_ what a human being expects when the computer is 
asked to order the input, which is the whole reason for having 
culture-specific support in various APIs, including .NET.

> I mean if we use the normal ascii table we would have said that str1 < str2 
> because the letter l is less then u.

The 0x040A LCID is not even listed on the reference that I looked at 
(http://msdn.microsoft.com/en-us/goglobal/bb896001.aspx).  But, we can 
see on the documentation for the CultureInfo class that it's used to 
indicate a "traditional" Spanish-specific sorting.

And for whatever reason (I don't speak Spanish, so I couldn't tell you 
why), the word "llegar" is alphabetized after "lugar".  So that's what 
the Compare() method tells you when you compare them.

If you want to know why in the "traditional" ordering, "llegar" comes 
after "lugar", but in the "international" ordering, it comes before, you 
need to ask someone who knows about Spanish culture.  It's not a 
programming question.

Pete
0
Reply Peter 2/10/2010 12:13:47 AM


Peter Duniho wrote:
> Tony Johansson wrote:
>> Hi!
>>
>> Below is a simple program that is using the Comparer class to compare 
>> two strings named str1 and str2.
>> If I use the 0x040A as the first argument to the CultureInfo I use the 
>> traditional sort order accoding to the MSDN documentation that you can 
>> find at the bottom.
> 
> At the bottom of what?
> 
>> The WriteLine statement in the program is writing 1 as the value 
>> meaning that str1 > str2.
>> Can somebody explain how this works because the comparing is not based 
>> on the ascii table  ?
> 
> What do you want to know?  If you want all the gory details of the 
> comparison, you need to just look at the implementation (which may or 
> may not involve diving into the unmanaged Windows API).
> 
> The basic answer is: duh, of course a culture-specific comparison must 
> not be based on the ASCII character values.  That's the whole point of a 
> culture-specific comparison, as ASCII is itself not a culturally-based 
> character encoding.
> 
> Instead, when you do a culture-specific comparison, it uses whatever 
> ordering rules exist for that specific culture.  Humans being the kind 
> of animal they are, these rules aren't always logical.  Even when they 
> are logical, the logic does not necessarily follow the representation of 
> characters and words as found in a computer.
> 
> But, those rules _are_ what a human being expects when the computer is 
> asked to order the input, which is the whole reason for having 
> culture-specific support in various APIs, including .NET.
> 
>> I mean if we use the normal ascii table we would have said that str1 < 
>> str2 because the letter l is less then u.
> 
> The 0x040A LCID is not even listed on the reference that I looked at 
> (http://msdn.microsoft.com/en-us/goglobal/bb896001.aspx).  But, we can 
> see on the documentation for the CultureInfo class that it's used to 
> indicate a "traditional" Spanish-specific sorting.
> 
> And for whatever reason (I don't speak Spanish, so I couldn't tell you 
> why), the word "llegar" is alphabetized after "lugar".  So that's what 
> the Compare() method tells you when you compare them.
> 
> If you want to know why in the "traditional" ordering, "llegar" comes 
> after "lugar", but in the "international" ordering, it comes before, you 
> need to ask someone who knows about Spanish culture.  It's not a 
> programming question.

The Spanish alphabet is, officially, a, b, c, ch, d, e, f, g, h, i, j, 
k, l, ll, m, n, �, o, p, q, r, s, t, u, v, w, x, y, z. The digraph "ll" 
which has its own pronunciation distinct from that of "l", has been 
treated as a single letter, in the same way as the digraph "ch".

However, a 1994 international language reform passed during the Tenth 
Congress of the Association of Spanish Language Academies decreed that 
henceforth, for purposes of sorting, "ch" and "ll" should be treated as 
two separate letters, so the official order would now be {llegar, 
lugar), despite the fact that "llegar" is still officially considered to 
consist of five letters. Weird, but official, and perhaps enacted in 
order to avoid the kinds of problems involved in international, 
computerized data exchange, given that everybody, Spanish speakers 
included, *types* "ch" and "ll" each as a sequence of two letters 
instead of as a digraph.
0
Reply Harlan 2/15/2010 4:25:54 PM

2 Replies
219 Views

(page loaded in 0.517 seconds)


Reply: