A curated list of delightful Unicode tidbits, packages and resources.
Unicode is Awesome! Prior to Unicode, international communication was grueling- everyone had defined their separate extended character set in the upperhalf of ASCII (called Code Pages) that would conflict- Just think, German speakers coordinating with Korean speakers over which 127 character Code Page to use. Thankfully the Unicode standard caught on and unified communication. Unicode 8.0 standardizes over 120,000 characters from over 129 scripts – some modern, some ancient, and some still undeciphered. Unicode handles left-to-right and right-to-left text, combining marks, and includes diverse cultural, political, religious characters and emojis. Unicode is awesomely human – and ultimately underappreciated.
- Quick Unicode Background
- Awesome Characters List
- Quirks and Troubleshooting
- Awesome Packages & Libraries
- Emojis
- Creatively Naming Variables and Methods
- Unicode Fonts
- More Reading
- Exploring Deeper into Unicode Yourself
- Overview Map
- Principles of the Unicode Standard
- Unicode Versions
What Characters Does the Unicode Standard Include?
The Unicode Standard defines codes for characters used in all the major languages written today. Scripts include the European alphabetic scripts, Middle Eastern right-to-left scripts, and many scripts of Asia.
The Unicode Standard further includes punctuation marks, diacritics, mathematical symbols, technical symbols, arrows, dingbats, emoji, etc. It provides codes for diacritics, which are modifying character marks such as the tilde (~), that are used in conjunction with base characters to represent accented letters (ñ, for example). In all, the Unicode Standard, Version 9.0 provides codes for 128,172 characters from the world’s alphabets, ideograph sets, and symbol collections.
The majority of common-use characters fit into the first 64K code points, an area of the codespace that is called the basic multilingual plane, or BMP for short. There are sixteen other supplementary planes available for encoding other characters, with currently over 850,000 unused code points. More characters are under consideration for addition to future versions of the standard.
The Unicode Standard also reserves code points for private use. Vendors or end users can assign these internally for their own characters and symbols, or use them with specialized fonts. There are 6,400 private use code points on the BMP and another 131,068 supplementary private use code points, should 6,400 be insufficient for particular applications.
Unicode Character Encodings
Character encoding standards define not only the identity of each character and its numeric value, or code point, but also how this value is represented in bits.
The Unicode Standard defines three encoding forms that allow the same data to be transmitted in a byte, word or double word oriented format (i.e. in 8, 16 or 32-bits per code unit). All three encoding forms encode the same common character repertoire and can be efficiently transformed into one another without loss of data. The Unicode Consortium fully endorses the use of any of these encoding forms as a conformant way of implementing the Unicode Standard.
UTF-8 is popular for HTML and similar protocols. UTF-8 is a way of transforming all Unicode characters into a variable length encoding of bytes. It has the advantages that the Unicode characters corresponding to the familiar ASCII set have the same byte values as ASCII, and that Unicode characters transformed into UTF-8 can be used with much existing software without extensive software rewrites.
UTF-16 is popular in many environments that need to balance efficient access to characters with economical use of storage. It is reasonably compact and all the heavily used characters fit into a single 16-bit code unit, while all other characters are accessible via pairs of 16-bit code units.
UTF-32 is useful where memory space is no concern, but fixed width, single code unit access to characters is desired. Each Unicode character is encoded in a single 32-bit code unit when using UTF-32.
All three encoding forms need at most 4 bytes (or 32-bits) of data for each character.
Lets talk Numbers
The Unicode characterset is divided into 17 core segments called “planes”, which are further divided into blocks. Each plane has space for 65,536 (2¹⁶) codepoints, supporting a grand total of 1,114,112 codepoints. There are two “Private Use Area” planes (#16 & #17) that are allocated to be used however one wishes. These two Private Use planes account for 131,072 codepoints.
# | Name | Range |
---|---|---|
1. | Basic Multilingual Plane | (U 0000 to U FFFF) |
2. | Supplementary Multilingual Plane | (U 10000 to U 1FFFF) |
3. | Supplementary Ideographic Plane | (U 20000 to U 2FFFF) |
4. | Tertiary Ideographic Plane | (U 30000 to U 3FFFF) |
5. | Plane 5 (unassigned) | (U 40000 to U 4FFFF) |
6. | Plane 6 (unassigned) | (U 50000 to U 5FFFF) |
7. | Plane 7 (unassigned) | (U 60000 to U 6FFFF) |
8. | Plane 8 (unassigned) | (U 70000 to U 7FFFF) |
9. | Plane 9 (unassigned) | (U 80000 to U 8FFFF) |
10. | Plane 10 (unassigned) | (U 90000 to U 9FFFF) |
11. | Plane 11 (unassigned) | (U A0000 to U AFFFF) |
12. | Plane 12 (unassigned) | (U B0000 to U BFFFF) |
13. | Plane 13 (unassigned) | (U C0000 to U CFFFF) |
14. | Plane 14 (unassigned) | (U D0000 to U DFFFF) |
15. | Supplementary Special-purpose Plane | (U E0000 to U EFFFF) |
16. | Supplementary Private Use Area – A | (U F0000 to U FFFFF) |
17. | Supplementary Private Use Area – B | (U 100000 to U 10FFFF) |
The first plane is called the Basic Multilingual Plane or BMP. It contains the code points from U 0000 to U FFFF, which are the most frequently used characters. The other sixteen planes (U 010000 → U 10FFFF) are called supplementary planes or astral planes.
UTF-16 Surrogate Pairs
Characters outside the BMP, e.g. U 1D306 tetragram for centre (?), can only be encoded in UTF-16 using two 16-bit code units: 0xD834 0xDF06. This is called a surrogate pair. Note that a surrogate pair only represents a single character.
The first code unit of a surrogate pair is always in the range from 0xD800 to 0xDBFF, and is called a high surrogate or a lead surrogate.
The second code unit of a surrogate pair is always in the range from 0xDC00 to 0xDFFF, and is called a low surrogate or a trail surrogate.
Surrogate pair: A representation for a single abstract character that consists of a
sequence of two 16-bit code units, where the first value of the pair is a high-surrogate
code unit and the second value is a low-surrogate code unit. Surrogate pairs are used only in UTF-16.
Calculating Surrogate Pairs
The Unicode character ? Pile of Poo (U 1F4A9) in UTF-16 must be encoded as a surrogate pair, i.e. two surrogates. To convert any code point to a surrogate pair, use the following algorithm (in JavaScript). Keep in mind that we’re using hexidecimal notation.
var High_Surrogate = function(Code_Point){ return Math.floor((Code_Point - 0x10000) / 0x400) 0xD800 };
var Low_Surrogate = function(Code_Point){ return (Code_Point - 0x10000) % 0x400 0xDC00 };
// Reverses The Conversion
var Code_Point = function(High_Surrogate, Low_Surrogate){
return (High_Surrogate - 0xD800) * 0x400 Low_Surrogate - 0xDC00 0x10000;
};
> var codepoint = 0x1F4A9; // 0x1F4A9 == 128169
> High_Surrogate(codepoint).toString(16)
"d83d" // 0xD83D == 55357
> Low_Surrogate(codepoint).toString(16)
"dca9" // 0xDCA9 == 56489
> String.fromCharCode( High_Surrogate(codepoint) , Low_Surrogate(codepoint) );
"?"
> String.fromCodePoint(0x1F4A9)
"?"
> 'ud83dudca9'
"?"
Composing & Decomposing
Unicode includes a mechanism for modifying character shape that greatly extends the supported glyph repertoire. This covers the use of combining diacritical marks. They are inserted after the main character. Multiple combining diacritics may be stacked over the same character. Unicode also contains precomposed versions of most letter/diacritic combinations in normal use.
Certain sequences of characters can also be represented as a single character, called a precomposed character (or composite or decomposible character). For example, the character “ü” can be encoded as the single code point U 00FC “ü” or as the base character U 0075 “u” followed by the non-spacing character U 0308 “¨”. The Unicode Standard encodes precomposed characters for compatibility with established standards such as Latin 1, which includes many precomposed characters such as “ü” and “ñ”.
Precomposed characters may be decomposed for consistency or analysis. For example, in alphabetizing (collating) a list of names, the character “ü” may be decomposed into a “u” followed by the non-spacing character “¨”. Once the character has been decomposed, it may be easier for the collation to work with the character because it can be processed as a “u” with modifications. This allows easier alphabetical sorting for languages where character modifiers do not affect alphabetical order. The Unicode Standard defines the decompositions for all precomposed characters. It also defines normalization forms to provide for unique representations of characters.
Myths of Unicode
From Mark Davis’s Unicode Myths slides.
-
Unicode is simply a 16-bit code – Some people are under the misconception that Unicode is simply a 16-bit code where each character takes 16 bits and therefore there are 65,536 possible characters. This is not, actually, correct. It is the single most common myth about Unicode, so if you thought that, don’t feel bad.
-
You can use any unassigned codepoint for internal use – No. Eventually that hole will be filled with a different character. Instead use private use or noncharacters.
-
Every Unicode code point represents a character – No. There are lots of nonCharacters (FFFE, FFFF, 1FFFE,…)
There are also surrogate code points, private and unassigned codepoints, and control/format “characters” (RLM, ZWNJ,…) -
Unicode will run out of space – If it were linear, we would run out in 2140 AD. But it isn’t linear. See https://www.unicode.org/roadmaps/
-
Case mappings are 1-1 – No. They can also be:
- One-to-many: (ß → SS )
- Contextual: (…Σ ↔ …ς AND …ΣΤ… ↔ …στ… )
- Locale-sensitive: ( I ↔ ı AND İ ↔ i )
Applied Unicode Encodings
Encoding Type | Raw Encoding |
---|---|
HTML Entity (Decimal) | ? |
HTML Entity (Hexadecimal) | ? |
URL Escape Code | ? |
UTF-8 (hex) | 0xF0 0x9F 0x96 0x96 (f09f9696) |
UTF-8 (binary) | 11110000:10011111:10010110:10010110 |
UTF-16/UTF-16BE (hex) | 0xD83D 0xDD96 (d83ddd96) |
UTF-16LE (hex) | 0x3DD8 0x96DD (3dd896dd) |
UTF-32/UTF-32BE (hex) | 0x0001F596 (0001f596) |
UTF-32LE (hex) | 0x96F50100 (96f50100) |
Octal Escape Sequence | 360237226226 |
Source Code
Encoding Type | Raw Encoding |
---|---|
JavaScript | u1F596 |
JSON | u1F596 |
C | u1F596 |
C | u1F596 |
Java | u1F596 |
Python | u1F596 |
Perl | x{1F596} |
Ruby | u{1F596} |
CSS |