MySQL Character Sets & Collation

character encoding questions mark

In this post, we will focus on a specific type of character set and encoding, MySQL CHARSET & COLLATION.  I also recommend checking out the earlier post on character encoding for beginners.

Have you ever wondered why some non-English characters with accents or Chinese characters such as “豆贝尔维 ” showing as question marks ���� on a web page?

TL:DR;

  1. Best practice is to go completely utf8mb4.
    SET NAMES utf8mb4
  2. Set column collation to
    utf8mb4_0900_ai_ci
  3. On your web page include
    <meta http-equiv=”Content-type” content=”text/html; charset=UTF-8″>

MySQL Character Sets & Collation

MySQL supports various character sets to store almost every character in a string. You can find out the current character set supported by using the following MySQL command:

mysql> SHOW CHARACTER SET;

‘Collation’ refers to how those characters could be compared for inequality in WHERE and ORDER BY, GROUP BY, FOREIGN KEY CONSTRAINTS, and strings case (in)sensitive comparison.

Even though MySQL has been around since 1996, it didn’t start supporting utf8 encoding until 2004 in version 4. Before that, only English characters can be stored. In this version, instead of utf8, the default charset / collation was set to “latin1 / latin1_swedish_ci“.

MySQL’s UTF8 problem

Although the name suggests UTF-8, MySQL’s utf8 encoding is a proprietary character encoding that only supported three bytes per character. The actual UTF-8 encoding needs up to four bytes per character, as discussed in the “Character set for beginners” post. As a result, the encoding can’t encode many Unicode characters.

In 2010 (six years later!), MySQL finally fixed this embarrassing oversight of the utf8 encoding bug in version 5.5 with a new character set called “utf8mb4″. It’s never publicly announced, probably too ashamed to admit such humiliating mistakes. Consequently, few people knew about the significance of the change and remained in the dark with the old (and wrong) encoding. 

If you are running MySQL before 5.5, you got only ‘utf8‘. You are out of luck if you need to store and display Emoji 👌 or Chinese 汉字. You will need ‘utf8mb4‘, which is only available in MySQL 5.7+. 

Fortunately, since MySQL 8 (released in 2018), utf8mb4 becomes the default character set with collation utf8mb4_0900_ai_ci. We can finally put that MySQL Unicode encoding madness behind us.

What about MariaDB?

MariaDB suffered from the same fate as MySQL in UTF8 encoding. Fortunately, it has added utf8mb4 support since version 5.5 in 2012.

Other Character Set Problems

Besides MySQL historical utf8 bug, there are additional three factors to character set problems (apply to all databases):

  1. Client’s bytes encoding;
  2. What SET NAMES you use (or what the default is);
  3. The CHARACTER SET on the column definition.

All is well if the SET NAMES agrees with the encoding of the bytes in the Clients. 

CHARACTER SET in column definition needs not to agree with SET NAMES;  if they differ, a conversion will be performed for you. If characters exist in both encodings, the conversion will be transparent. 

The problems occur when SET NAMES disagree with the Client’s bytes. 

If a source character does not exist in the target encoding (example: when converting a Chinese character from utf8mb4 to latin1), a “�” is usually put in its place since Chinese has way more characters than what is in Latin. 

MySQL/MariaDB Character Encoding Best Practice

To wrap up, you should upgrade to MySQL 8, and go completely utf8mb4

SET NAMES utf8mb4

In addition, for web content, a web page (if that is what you have) should always include following in <head> section:

<meta http-equiv="Content-type" content="text/html; charset=UTF-8">

Recommended readings: