mersenneforum.org  

Go Back   mersenneforum.org > Great Internet Mersenne Prime Search > PrimeNet

Reply
 
Thread Tools
Old 2016-08-21, 05:21   #1101
retina
Undefined
 
retina's Avatar
 
"The unspeakable one"
Jun 2006
My evil lair

10110000000102 Posts
Default UTF-8 encoding conversion failure?

With reference to the following assignment:
http://www.mersenne.org/report_expon...6917443&full=1

It shows as assigned to user: "João" and the webpage encoding is defined as UTF-8, but clearly the characters are wrong.

More likely the user name is João which would encode the ã as the two UTF-8 bytes ã.

So presumably the DB has recorded the user name in UTF-8 form (so far so good), but upon retrieval it gets pulled out as bytes and re-encoded to UTF-8? Or maybe it has recorded each UTF-8 byte as a separate UTF-8 character? Anyhow, whatever it is, is ain't right IMO.

Last fiddled with by retina on 2016-08-21 at 05:23
retina is online now   Reply With Quote
Old 2016-08-21, 08:02   #1102
GP2
 
GP2's Avatar
 
Sep 2003

2,579 Posts
Default

Quote:
Originally Posted by retina View Post
It shows as assigned to user: "João" and the webpage encoding is defined as UTF-8, but clearly the characters are wrong.
There are quite a few of these distorted names, and most of them can be identified unambiguously:

cp1250: £o¿yñski → Łożyński
cp1251: þðèé → Юрий
HTML: Vidaković → Vidaković
UTF-8: Løkken → Løkken
ISO 8859-2: ©eliga → Šeliga
ISO 8859-9: Buðday → Buğday
KSC: ±è½Ã¸ó → 김시몬

GIMPS started 20 years ago, and many of these names date back to an era when Unicode was not common and a lot of 8-bit character sets were in use.

Most of the time it's a straightforward mapping, but in a few cases the eight-bit character sets map to 8-bit control codes (\x80 to \x9F), and then the character is replaced by a question mark and you have to use language awareness:

cp1250: Tomá? ?ikorský → Tomáš Šikorský

Google can help nail these cases down with confidence.

Sometimes these raw 8-bit control codes are actually in the database as such:

UTF-8: KeÃ<9f>ler → Keßler
cp850: J<81>rgen → Jürgen

(e.g., here the fourth byte is a literal hex 9F, represented above as <9f>)

UTF-8: <e6><b1><aa><e6><98><be><e6><9e><97> → 汪显林
SJIS: <89><c1><93><a1> <96><46><8f><ba> → 加藤 芳昭

We could fix most of these systematically, although a few cases seem hopelessly undecipherable. The question is: can the database handle full-blown Unicode names in Cyrillic, Korean, Japanese, Chinese, eastern European, etc.

Right now, though, the database already contains some unprintable 8-bit control characters in the \x80 to \x9F range and that isn't a good thing.

Last fiddled with by GP2 on 2016-08-21 at 08:22
GP2 is offline   Reply With Quote
Old 2016-08-21, 08:24   #1103
retina
Undefined
 
retina's Avatar
 
"The unspeakable one"
Jun 2006
My evil lair

2·32·313 Posts
Default

Quote:
Originally Posted by GP2 View Post
The question is: can the database handle full-blown Unicode names in Cyrillic, Korean, Japanese, Chinese, eastern European, etc.
I think UTF-8 is a better option than Unicode.
Quote:
Originally Posted by GP2 View Post
Right now, though, the database already contains some unprintable 8-bit control characters in the \x80 to \x9F range and that isn't a good thing.
Those are in the UTF-8 encoding space, so should not be a problem.
retina is online now   Reply With Quote
Old 2016-08-21, 08:45   #1104
GP2
 
GP2's Avatar
 
Sep 2003

2,579 Posts
Default

Quote:
Originally Posted by retina View Post
I think UTF-8 is a better option than Unicode.
Same thing. UTF-8 is just a very common way of encoding Unicode.

I think PrimeNet uses SQL Server. Maybe Madpoo could give us a bit of insight into the data type of the relevant database field?

I could supply him with a set of corrected names, assuming the database is able to store them.
GP2 is offline   Reply With Quote
Old 2016-08-21, 09:47   #1105
retina
Undefined
 
retina's Avatar
 
"The unspeakable one"
Jun 2006
My evil lair

2×32×313 Posts
Default

Quote:
Originally Posted by GP2 View Post
Same thing. UTF-8 is just a very common way of encoding Unicode.
Okay, but I meant that the server could just be told that it is all bytes and it doesn't need to be clever with handling any code conversions. The only thing affected in that mode is that sorting and lexicographic ordering might not appear logical.
retina is online now   Reply With Quote
Old 2016-08-21, 16:30   #1106
GP2
 
GP2's Avatar
 
Sep 2003

2,579 Posts
Default

Here's an old thread from 2014 which says that the SQL Server database field is varchar rather than nvarchar, so it can't store Unicode. Unfortunate.

It would still be possible to fix a subset of the names, including "João" to "João".

Or all the accented names could be converted to use HTML &-style entity numbers, like webpages do.

Edit: hmm, that wouldn't work, see M50685343, where the name displays literally as "Vladan Vidakovi&#263;" rather than "Vladan Vidaković"

Last fiddled with by GP2 on 2016-08-21 at 16:35
GP2 is offline   Reply With Quote
Old 2016-08-21, 17:07   #1107
Madpoo
Serpentine Vermin Jar
 
Madpoo's Avatar
 
Jul 2014

2·1,637 Posts
Default

Quote:
Originally Posted by GP2 View Post
Here's an old thread from 2014 which says that the SQL Server database field is varchar rather than nvarchar, so it can't store Unicode. Unfortunate.

It would still be possible to fix a subset of the names, including "João" to "João".

Or all the accented names could be converted to use HTML &-style entity numbers, like webpages do.

Edit: hmm, that wouldn't work, see M50685343, where the name displays literally as "Vladan Vidaković" rather than "Vladan Vidaković"
There are indeed some oddities when characters fall outside the extended ASCII range. The SQL table for usernames is unfortunately varchar instead of nvarchar.

A fix for that would be to alter the table and convert to nvarchar... I imagine for the most part it would be an easy enough transition since SQL can typically do implicit casts between them and won't complain, but there would be issues in boolean operations where varchar <> nvarchar so we'd have to dig into it more to make sure we're good.
Madpoo is offline   Reply With Quote
Old 2016-08-21, 17:21   #1108
science_man_88
 
science_man_88's Avatar
 
"Forget I exist"
Jul 2009
Dumbassville

26×131 Posts
Default

Quote:
Originally Posted by GP2 View Post
Edit: hmm, that wouldn't work, see M50685343, where the name displays literally as "Vladan Vidaković" rather than "Vladan Vidaković"
my first thought is a way to stop cross site scripting using forms because technically( in theory) a script tag can use a form that interprets these characters ( for example the less than or greater than sign) in theory if processed into what they are meant for to cross site script or introduce code if not read like a normal string so for example if I wrote a code like < script ></script > in a name field in theory if it got changed without at least being checked it could activate any code inside. and be read as part of the page by the browser so I could in theory make a self retweeting tweet scenario. if you saw this post before posting you would see HTML equivalents of the less than and greater than signs. if these were to get parsed in theory I could make a code work to break it .

Last fiddled with by science_man_88 on 2016-08-21 at 18:04
science_man_88 is offline   Reply With Quote
Old 2016-08-21, 20:00   #1109
GP2
 
GP2's Avatar
 
Sep 2003

2,579 Posts
Default

Quote:
Originally Posted by Madpoo View Post
I imagine for the most part it would be an easy enough transition since SQL can typically do implicit casts between them and won't complain, but there would be issues in boolean operations where varchar <> nvarchar so we'd have to dig into it more to make sure we're good.
How normalized is the database? Does the Public Name (from http://www.mersenne.org/update/ ) really get stored in more than one place, or get compared against, or used for purposes other than display? I would have thought that the V5UserID from prime.txt or the SrvrUID from local.txt would be the key fields in the tables.

Last fiddled with by GP2 on 2016-08-21 at 20:02
GP2 is offline   Reply With Quote
Old 2016-08-23, 16:04   #1110
Madpoo
Serpentine Vermin Jar
 
Madpoo's Avatar
 
Jul 2014

2·1,637 Posts
Default

Quote:
Originally Posted by GP2 View Post
How normalized is the database? Does the Public Name (from http://www.mersenne.org/update/ ) really get stored in more than one place, or get compared against, or used for purposes other than display? I would have thought that the V5UserID from prime.txt or the SrvrUID from local.txt would be the key fields in the tables.
The user's public name is only referenced in one table, which makes sense of course.

I thought about it some today and my best guess on how to improve it would be to add an additional nvarchar column and slowly port any DB calls/sprocs/etc to use that instead, deprecating and eventually removing the old column.

Kinda messes things up but if this were my day job and I wanted to ensure reliable operations, that's how I'd do it... gradually and surely with some bleedover time.

But even that means making sure any updates/inserts aren't using implicit column placements... I'm pretty sure anything like that is explicitly handled (as far as I've seen, that's the case) but Murphy's law says there'd be that one exception where an insert assumes columns are in a certain order and doesn't spell 'em out.
Madpoo is offline   Reply With Quote
Old 2016-08-23, 16:41   #1111
chalsall
If I May
 
chalsall's Avatar
 
"Chris Halsall"
Sep 2002
Barbados

2·13·353 Posts
Default

Quote:
Originally Posted by Madpoo View Post
But even that means making sure any updates/inserts aren't using implicit column placements... I'm pretty sure anything like that is explicitly handled (as far as I've seen, that's the case) but Murphy's law says there'd be that one exception where an insert assumes columns are in a certain order and doesn't spell 'em out.
Implicit column placements... Shutter....

Your planned migration sounds sane. Add the new column at the end of the table and the worst case will be the new nvarchar column isn't updated by any legacy code doing implicit references.
chalsall is offline   Reply With Quote
Reply

Thread Tools


Similar Threads
Thread Thread Starter Forum Replies Last Post
Official "Faits erronés dans de belles-lettres" thread ewmayer Lounge 39 2015-05-19 01:08
Official "all-Greek-to-me Fiction Literature and Cinema" Thread ewmayer Science & Technology 41 2014-04-16 11:54
Official "Lasciate ogne speranza" whinge-thread cheesehead Soap Box 56 2013-06-29 01:42
Official "Ernst is a deceiving bully and George is a meanie" thread cheesehead Soap Box 61 2013-06-11 04:30
Official "String copy Statement Considered Harmful" thread Dubslow Programming 19 2012-05-31 17:49

All times are UTC. The time now is 11:57.

Fri Aug 14 11:57:56 UTC 2020 up 1 day, 8:33, 0 users, load averages: 1.74, 1.60, 1.88

Powered by vBulletin® Version 3.8.11
Copyright ©2000 - 2020, Jelsoft Enterprises Ltd.

This forum has received and complied with 0 (zero) government requests for information.

Permission is granted to copy, distribute and/or modify this document under the terms of the GNU Free Documentation License, Version 1.2 or any later version published by the Free Software Foundation.
A copy of the license is included in the FAQ.