View Single Post
Old 2016-08-30, 17:39   #1131
Mark Rose
Mark Rose's Avatar
Jan 2013

32·11·29 Posts

Originally Posted by Madpoo View Post
With SQL collations it's more about the sorting, or even how certain umlauts are handled (it surprised me to see "UE" treated as equal to "Ü".

And when sorting, should "Éclair" come before/after "eclair", or, with binary collation, will it show up after "zebra"?

Curiously, even for the same language, different locales may opt to sort accented characters differently. I'm trying to remember the example... I don't know if it was a difference in fr-FR and fr-CA, or maybe pt-PT and pt-BR. Whatever the case... languages are funny things.

Those account for the accent-sensitive and case-sensitive options in collations, but then with other charsets like Cyrillic and Polish (not to mention Chinese, Japanese and Korean...CJK) you have to pay even more attention, and when comparing across the two, find some common collation (like binary, perhaps) where both character sets have a place to live.

Before my current job I never thought a single moment about any of this... SQL collations, ASCII folding in search indices, German decompounding when indexing/searching, the double-wide western characters in CJK (or the frustrating search for a decent font that can show all of the common Unicode characters, monospaced. One that has Japanese *and* Korean, and won't show the Korean characters sideways as many of the freebies would do).

We haven't expanded to Turkey, Greece, any of the Arab countries or Israel. I imagine the fun we'll have if/when we do our first right-to-left (RTL) language and how that would impact our entire design... LOL

Anyway, I don't blame anyone for making the DB columns varchar like they are now... until it's a problem, you don't really know how interesting that makes things.
Yikes. I'm glad don't have to deal with all that.

But I have seen collation errors at the library. In French, é is sorted with e, as it's an accented letter. But in Swedish, for instance, å is the 27th letter of the alphabet (which ends in xyzåäö) and not an accented letter. The poor librarians collated å, ä, and ö, with a and o. Interesting, Swedish does have accented letters, mainly in French and German loan words. So you'll see é and ü occasionally, and some others, but they're not separate letters. Fun times.

French has messed up collation rules. Sorting of accented characters is done right-to-left.
Mark Rose is offline   Reply With Quote