Character Encodings for Beginners

Character encoding governs how characters travel from databases to web browsers and back again. It has a long and winding history behind characters and their encoding.

Byte & Character

A byte has 8 bits that can represent 256 different values. That’s more than enough for all the English characters plus additional numerical digits, common punctuation marks, spaces, tabs, and other control characters. ASCII character encoding has become an 8-bit character encoding standard for electronic communication since the 1960s.

ascii

However, since many non-English languages have more than 256 characters, such as Korean (over 11,000 characters) and Chinese (40,000+ ), it is impossible to squeeze many characters into a single byte. To hold the non-English characters, a character must be composed of one or more bytes.

Unicode – A New Character Set Standards

In the late 1980s, a new standard was developed – Unicode – a modern encoding system that can produce over a million code points, more than enough to account for every character in any language. Unicode is the new universal standard for encoding all human languages. It even includes emojis

Unicode is a character standard, not character encoding. Computers need a way to translate Unicode into binary in order to store them in text files. Here’s where UTF-8 comes in.

UTF-8 Encoding

UTF-8 is an encoding system for Unicode. UTF-8 uses a creative multi-byte variable-width encoding method to save storage while accommodating multi-byte characters. In UTF-8, some characters like X take only 1 byte, and some characters like emoji 😄 can take as much as 4.

Best of all, UTF-8 is backward compatible with ASCII. 

There are three different varieties of Unicode character encodings: UTF-8, UTF-16, and UTF-32. Of these three. Only UTF-8 should be used for web content. 

Since an HTML page can only be in one encoding, a Unicode-based encoding such as UTF-8 can support many languages and accommodate pages and forms in any mixture of those languages (source: w3.org).

What Should You Do As a Web Developer?

At the time of this writing, UTF-8 is used by 96.4% of all the websites whose character encoding we know.

Web page should include following in <head> section

<meta http-equiv="Content-type" content="text/html; charset=UTF-8">

<meta charset= “utf-8”> tells the browser to use the utf-8 character encoding when translating machine code into human-readable text and vice versa to be displayed in the browser.

The good news is that the default character encoding used in HTML5 is already UTF-8. Include the following

<!DOCTYPE html>

at the top of your HTML file (which declares that it’s an HTML5 file) should automatically set your web page as UTF-8 unless specified otherwise in the aforementioned meta element.

Happy coding!

Recommended reading:
https://www.w3.org/International/questions/qa-choosing-encodings