mersenneforum.org  

Go Back   mersenneforum.org > Great Internet Mersenne Prime Search > PrimeNet

Reply
 
Thread Tools
Old 2016-08-28, 05:29   #1123
Madpoo
Serpentine Vermin Jar
 
Madpoo's Avatar
 
Jul 2014

29·113 Posts
Default

Quote:
Originally Posted by retina View Post
I don't understand the need for 16-bit fields? UTF8 only needs bytes. Browsers submit UTF8 bytes, things are compared as UTF8 bytes, responses are delivered in UTF8 byes. There is no need to convert at any point.
It's a SQL thing... it's about storing in Unicode specifically, not just UTF-8.

I don't know about other SQL servers, but MSSQL stores things in UCS-2, not UTF-8, so if you want more character support you're looking at nvarchar. I guess it could be varbinary too if you didn't mind making the app figure it out that way, but whatever.

Where I always get into trouble with nvarchar is when doing things with static strings, I always seem to forget to use N'string goes here' and use just 'string goes here' without the N, but that's typically also when there's a cast to varbinary in the mix as well.

e.g. cast('string' to varbinary(30)) != cast(N'string' to varbinary(30))
Madpoo is offline   Reply With Quote
Old 2016-08-28, 19:05   #1124
retina
Undefined
 
retina's Avatar
 
"The unspeakable one"
Jun 2006
My evil lair

22×1,447 Posts
Default

Quote:
Originally Posted by Madpoo View Post
It's a SQL thing... it's about storing in Unicode specifically, not just UTF-8
Actually I think it is a sorting thing. They only purpose I know of for informing the DB that some data are in specific text formats is to sort things "logically" so that letters with accents will sort close to the same letter without an accent, and other similar lexicographic reasons. For all other purposes, like display and comparison for example, simple byte data is just fine.
retina is online now   Reply With Quote
Old 2016-08-28, 20:20   #1125
Mark Rose
 
Mark Rose's Avatar
 
"/X\(‘-‘)/X\"
Jan 2013

B3716 Posts
Default

Quote:
Originally Posted by retina View Post
Actually I think it is a sorting thing. They only purpose I know of for informing the DB that some data are in specific text formats is to sort things "logically" so that letters with accents will sort close to the same letter without an accent, and other similar lexicographic reasons. For all other purposes, like display and comparison for example, simple byte data is just fine.
It also matters for LIKE, STRCMP, etc.
Mark Rose is offline   Reply With Quote
Old 2016-08-29, 21:36   #1126
Madpoo
Serpentine Vermin Jar
 
Madpoo's Avatar
 
Jul 2014

29×113 Posts
Default

Quote:
Originally Posted by retina View Post
Actually I think it is a sorting thing. They only purpose I know of for informing the DB that some data are in specific text formats is to sort things "logically" so that letters with accents will sort close to the same letter without an accent, and other similar lexicographic reasons. For all other purposes, like display and comparison for example, simple byte data is just fine.
There are the collations which also play a factor (how things are sorted, case/accent sensitivity, etc.)

My day job may involve working with data in Cyrillic, Polish, CJK, etc. and those DB's in particular have their collation set to an appropriate one. I have a lot of "fun" when data is being transferred across DBs and matching up data, since I have to add some type of COLLATE (kind of like a CAST) to avoid the mismatching collation errors that inevitably arise.

I once knew a fella who insisted all DB's use binary collation... at the time we all thought he was pretty out there for that idea and we actually had to do a project to switch to SQL_Latin1... but when dealing with multiple languages and character sets, it's actually a better argument for binary if you don't mind specifying what collation you want to have your rows sorted, because binary just won't cut it.
Madpoo is offline   Reply With Quote
Old 2016-08-30, 01:58   #1127
Mark Rose
 
Mark Rose's Avatar
 
"/X\(‘-‘)/X\"
Jan 2013

32×11×29 Posts
Default

Quote:
Originally Posted by Madpoo View Post
There are the collations which also play a factor (how things are sorted, case/accent sensitivity, etc.)

My day job may involve working with data in Cyrillic, Polish, CJK, etc. and those DB's in particular have their collation set to an appropriate one. I have a lot of "fun" when data is being transferred across DBs and matching up data, since I have to add some type of COLLATE (kind of like a CAST) to avoid the mismatching collation errors that inevitably arise.

I once knew a fella who insisted all DB's use binary collation... at the time we all thought he was pretty out there for that idea and we actually had to do a project to switch to SQL_Latin1... but when dealing with multiple languages and character sets, it's actually a better argument for binary if you don't mind specifying what collation you want to have your rows sorted, because binary just won't cut it.
Indeed. Binary and UTF-8 are the only character sets that should ever be used in a modern system.
Mark Rose is offline   Reply With Quote
Old 2016-08-30, 15:15   #1128
Madpoo
Serpentine Vermin Jar
 
Madpoo's Avatar
 
Jul 2014

1100110011012 Posts
Default

Quote:
Originally Posted by Mark Rose View Post
Indeed. Binary and UTF-8 are the only character sets that should ever be used in a modern system.
With SQL collations it's more about the sorting, or even how certain umlauts are handled (it surprised me to see "UE" treated as equal to "Ü".

And when sorting, should "Éclair" come before/after "eclair", or, with binary collation, will it show up after "zebra"?

Curiously, even for the same language, different locales may opt to sort accented characters differently. I'm trying to remember the example... I don't know if it was a difference in fr-FR and fr-CA, or maybe pt-PT and pt-BR. Whatever the case... languages are funny things.

Those account for the accent-sensitive and case-sensitive options in collations, but then with other charsets like Cyrillic and Polish (not to mention Chinese, Japanese and Korean...CJK) you have to pay even more attention, and when comparing across the two, find some common collation (like binary, perhaps) where both character sets have a place to live.

Before my current job I never thought a single moment about any of this... SQL collations, ASCII folding in search indices, German decompounding when indexing/searching, the double-wide western characters in CJK (or the frustrating search for a decent font that can show all of the common Unicode characters, monospaced. One that has Japanese *and* Korean, and won't show the Korean characters sideways as many of the freebies would do).

We haven't expanded to Turkey, Greece, any of the Arab countries or Israel. I imagine the fun we'll have if/when we do our first right-to-left (RTL) language and how that would impact our entire design... LOL

Anyway, I don't blame anyone for making the DB columns varchar like they are now... until it's a problem, you don't really know how interesting that makes things.
Madpoo is offline   Reply With Quote
Old 2016-08-30, 15:31   #1129
James Heinrich
 
James Heinrich's Avatar
 
"James Heinrich"
May 2004
ex-Northern Ontario

13·239 Posts
Default

Quote:
Originally Posted by Madpoo View Post
... the double-wide western characters in CJK
I ran into that, where one client has an office in China and the users there would type things into my forms like "This is not ascii" which looks like normal letters, but it's all high-range unicode chars (FF01-FFE6) and of course doesn't match any text you're may be looking for (without doing some character set juggling as I ended up doing).
James Heinrich is offline   Reply With Quote
Old 2016-08-30, 15:38   #1130
science_man_88
 
science_man_88's Avatar
 
"Forget I exist"
Jul 2009
Dumbassville

836910 Posts
Default

Quote:
Originally Posted by Madpoo View Post
With SQL collations it's more about the sorting, or even how certain umlauts are handled (it surprised me to see "UE" treated as equal to "Ü".

And when sorting, should "Éclair" come before/after "eclair", or, with binary collation, will it show up after "zebra"?

Curiously, even for the same language, different locales may opt to sort accented characters differently. I'm trying to remember the example... I don't know if it was a difference in fr-FR and fr-CA, or maybe pt-PT and pt-BR. Whatever the case... languages are funny things.

Those account for the accent-sensitive and case-sensitive options in collations, but then with other charsets like Cyrillic and Polish (not to mention Chinese, Japanese and Korean...CJK) you have to pay even more attention, and when comparing across the two, find some common collation (like binary, perhaps) where both character sets have a place to live.

Before my current job I never thought a single moment about any of this... SQL collations, ASCII folding in search indices, German decompounding when indexing/searching, the double-wide western characters in CJK (or the frustrating search for a decent font that can show all of the common Unicode characters, monospaced. One that has Japanese *and* Korean, and won't show the Korean characters sideways as many of the freebies would do).

We haven't expanded to Turkey, Greece, any of the Arab countries or Israel. I imagine the fun we'll have if/when we do our first right-to-left (RTL) language and how that would impact our entire design... LOL

Anyway, I don't blame anyone for making the DB columns varchar like they are now... until it's a problem, you don't really know how interesting that makes things.
so basically you've been running into this a lot:

edit: realized my error tried to use a URL in a youtube tag.

Last fiddled with by science_man_88 on 2016-08-30 at 15:57
science_man_88 is offline   Reply With Quote
Old 2016-08-30, 17:39   #1131
Mark Rose
 
Mark Rose's Avatar
 
"/X\(‘-‘)/X\"
Jan 2013

32·11·29 Posts
Default

Quote:
Originally Posted by Madpoo View Post
With SQL collations it's more about the sorting, or even how certain umlauts are handled (it surprised me to see "UE" treated as equal to "Ü".

And when sorting, should "Éclair" come before/after "eclair", or, with binary collation, will it show up after "zebra"?

Curiously, even for the same language, different locales may opt to sort accented characters differently. I'm trying to remember the example... I don't know if it was a difference in fr-FR and fr-CA, or maybe pt-PT and pt-BR. Whatever the case... languages are funny things.

Those account for the accent-sensitive and case-sensitive options in collations, but then with other charsets like Cyrillic and Polish (not to mention Chinese, Japanese and Korean...CJK) you have to pay even more attention, and when comparing across the two, find some common collation (like binary, perhaps) where both character sets have a place to live.

Before my current job I never thought a single moment about any of this... SQL collations, ASCII folding in search indices, German decompounding when indexing/searching, the double-wide western characters in CJK (or the frustrating search for a decent font that can show all of the common Unicode characters, monospaced. One that has Japanese *and* Korean, and won't show the Korean characters sideways as many of the freebies would do).

We haven't expanded to Turkey, Greece, any of the Arab countries or Israel. I imagine the fun we'll have if/when we do our first right-to-left (RTL) language and how that would impact our entire design... LOL

Anyway, I don't blame anyone for making the DB columns varchar like they are now... until it's a problem, you don't really know how interesting that makes things.
Yikes. I'm glad don't have to deal with all that.

But I have seen collation errors at the library. In French, é is sorted with e, as it's an accented letter. But in Swedish, for instance, å is the 27th letter of the alphabet (which ends in xyzåäö) and not an accented letter. The poor librarians collated å, ä, and ö, with a and o. Interesting, Swedish does have accented letters, mainly in French and German loan words. So you'll see é and ü occasionally, and some others, but they're not separate letters. Fun times.

French has messed up collation rules. Sorting of accented characters is done right-to-left.
Mark Rose is offline   Reply With Quote
Old 2016-08-31, 05:13   #1132
Madpoo
Serpentine Vermin Jar
 
Madpoo's Avatar
 
Jul 2014

29·113 Posts
Default

Quote:
Originally Posted by science_man_88 View Post
so basically you've been running into this a lot:
Brilliant rant!

And yes... all of that. I loved the bit about adding an ellipsis to break up some text when you want to just show a snippet. We had to deal with that a little bit ago.

When dealing with words/sentences of unusual lengths, on one end of the spectrum we have Simplified Chinese (zh) where an entire sentence might only be 5-6 glyphs.

On the other end is German and Russian... mostly German. Their unusual (to non-Germans) way of compounding words into obnoxiously long things is amusing at first, and frustrating when you're trying to come up with text for menus or headers.

There's the funny example of Donaudampfschiffahrtselektrizitätenhauptbetriebswerkbauunterbeamtengesellschaft

And you can't just blanket upper/lowercase words for style reasons, because Germans won't like that.

Fortunately we have a staff of very dedicated and hard working editors/translators who work patiently with our devs to work around all of those special cases.

I'll have to share this video with the team... they'd get a kick out of it.

Anyway... to keep it on topic, so yeah, I guess I should find a way and time to alter some columns to nvarchar just to make sure we're storing display names (at the very least) in a suitable way.

Last fiddled with by Madpoo on 2016-08-31 at 05:14
Madpoo is offline   Reply With Quote
Old 2016-08-31, 14:23   #1133
James Heinrich
 
James Heinrich's Avatar
 
"James Heinrich"
May 2004
ex-Northern Ontario

C2316 Posts
Default

Quote:
Originally Posted by Madpoo View Post
... mostly German. Their unusual (to non-Germans) way of compounding words into obnoxiously long things is amusing at first

(this version has English translation but worse pictures/voiceover).

Last fiddled with by James Heinrich on 2016-08-31 at 14:47
James Heinrich is offline   Reply With Quote
Reply

Thread Tools


Similar Threads
Thread Thread Starter Forum Replies Last Post
Official "Faits erronés dans de belles-lettres" thread ewmayer Lounge 39 2015-05-19 01:08
Official "all-Greek-to-me Fiction Literature and Cinema" Thread ewmayer Science & Technology 41 2014-04-16 11:54
Official "Lasciate ogne speranza" whinge-thread cheesehead Soap Box 56 2013-06-29 01:42
Official "Ernst is a deceiving bully and George is a meanie" thread cheesehead Soap Box 61 2013-06-11 04:30
Official "String copy Statement Considered Harmful" thread Dubslow Programming 19 2012-05-31 17:49

All times are UTC. The time now is 10:17.

Wed Oct 21 10:17:49 UTC 2020 up 41 days, 7:28, 0 users, load averages: 1.88, 2.04, 1.85

Powered by vBulletin® Version 3.8.11
Copyright ©2000 - 2020, Jelsoft Enterprises Ltd.

This forum has received and complied with 0 (zero) government requests for information.

Permission is granted to copy, distribute and/or modify this document under the terms of the GNU Free Documentation License, Version 1.2 or any later version published by the Free Software Foundation.
A copy of the license is included in the FAQ.