View Single Post
Old 2016-08-30, 15:15   #1128
Serpentine Vermin Jar
Madpoo's Avatar
Jul 2014

29·113 Posts

Originally Posted by Mark Rose View Post
Indeed. Binary and UTF-8 are the only character sets that should ever be used in a modern system.
With SQL collations it's more about the sorting, or even how certain umlauts are handled (it surprised me to see "UE" treated as equal to "Ü".

And when sorting, should "Éclair" come before/after "eclair", or, with binary collation, will it show up after "zebra"?

Curiously, even for the same language, different locales may opt to sort accented characters differently. I'm trying to remember the example... I don't know if it was a difference in fr-FR and fr-CA, or maybe pt-PT and pt-BR. Whatever the case... languages are funny things.

Those account for the accent-sensitive and case-sensitive options in collations, but then with other charsets like Cyrillic and Polish (not to mention Chinese, Japanese and Korean...CJK) you have to pay even more attention, and when comparing across the two, find some common collation (like binary, perhaps) where both character sets have a place to live.

Before my current job I never thought a single moment about any of this... SQL collations, ASCII folding in search indices, German decompounding when indexing/searching, the double-wide western characters in CJK (or the frustrating search for a decent font that can show all of the common Unicode characters, monospaced. One that has Japanese *and* Korean, and won't show the Korean characters sideways as many of the freebies would do).

We haven't expanded to Turkey, Greece, any of the Arab countries or Israel. I imagine the fun we'll have if/when we do our first right-to-left (RTL) language and how that would impact our entire design... LOL

Anyway, I don't blame anyone for making the DB columns varchar like they are now... until it's a problem, you don't really know how interesting that makes things.
Madpoo is offline   Reply With Quote