View Single Post
Old 2016-08-21, 08:02   #1102
GP2
 
GP2's Avatar
 
Sep 2003

29·89 Posts
Default

Quote:
Originally Posted by retina View Post
It shows as assigned to user: "João" and the webpage encoding is defined as UTF-8, but clearly the characters are wrong.
There are quite a few of these distorted names, and most of them can be identified unambiguously:

cp1250: £o¿yñski → Łożyński
cp1251: þðèé → Юрий
HTML: Vidaković → Vidaković
UTF-8: Løkken → Løkken
ISO 8859-2: ©eliga → Šeliga
ISO 8859-9: Buðday → Buğday
KSC: ±è½Ã¸ó → 김시몬

GIMPS started 20 years ago, and many of these names date back to an era when Unicode was not common and a lot of 8-bit character sets were in use.

Most of the time it's a straightforward mapping, but in a few cases the eight-bit character sets map to 8-bit control codes (\x80 to \x9F), and then the character is replaced by a question mark and you have to use language awareness:

cp1250: Tomá? ?ikorský → Tomáš Šikorský

Google can help nail these cases down with confidence.

Sometimes these raw 8-bit control codes are actually in the database as such:

UTF-8: KeÃ<9f>ler → Keßler
cp850: J<81>rgen → Jürgen

(e.g., here the fourth byte is a literal hex 9F, represented above as <9f>)

UTF-8: <e6><b1><aa><e6><98><be><e6><9e><97> → 汪显林
SJIS: <89><c1><93><a1> <96><46><8f><ba> → 加藤 芳昭

We could fix most of these systematically, although a few cases seem hopelessly undecipherable. The question is: can the database handle full-blown Unicode names in Cyrillic, Korean, Japanese, Chinese, eastern European, etc.

Right now, though, the database already contains some unprintable 8-bit control characters in the \x80 to \x9F range and that isn't a good thing.

Last fiddled with by GP2 on 2016-08-21 at 08:22
GP2 is online now   Reply With Quote