Encoding gibberish

Un article de Loria Wiki.

When character encodings go awry (note: this page is UTF-8 encoded). The point here isn't to list all possible permutations, but the ones that we actually encounter in real life.

symptom original cause solution
l\351g\350re légère ISO 8859-1 text being marked as ASCII convert text from ISO8859-1 to UTF-8 (if in UTF-8); get the other party to fix their software
le'ge`re légère Netscape mail?
l?g?re légère ISO-8859-1 text being read as UTF-8 convert text to UTF-8
légère légère UTF-8 latin text being read as ISO 8859-1 switch your software to UTF-8
légère légère UTF-8 latin text mistakenly converted backwards (from latin-1 to utf-8) TWICE; crazy stuff! convert text to ISO8859-1 (yes, i know it's weird); you should get a proper UTF-8 file back
l√©g√®re légère UTF-8 latin text being read as MacRoman switch your software to UTF-8
lÈgËre légère ISO 8859-1 text being read as MacRoman convert text to UTF-8; switch your software to UTF-8
ë¸ãêèé лёгкий cp1251 (cyrillic encoding) text being interpreted as iso8859-1 iconv -f iso8859-1 -t cp1251 <file>

or iconv -f utf-8 -t iso8859-1 <file> | iconv -f cp1251 -t utf-8

МЈЗЛЙК лёгкий koi8-r (cyrillic encoding) text being interpreted as cp1251 (another cyrillic encoding) copy the text to a file in utf-8 ecodiing and issue the command

iconv -f utf-8 -t cp1251 <file> | iconv -f koi8-r -t utf-8

Note that I tend to prefer switching things to UTF-8 whenever possible. This encoding is ASCII friendly and it can represent any Unicode text. So, for example, if you want to mix Arabic with Chinese, or Russian with French, UTF-8 is the way to go.

See also

Outils personels