Encoding
UCS-2
Every character occupy 2 bytes.
UTF-8
A variant-length unicode encoding method. First byte imply bytes length.
| Byte Length | Byte 1 | Byte 2 | Byte 3 | Byte 4 | Byte 5 | Byte 6 |
|---|---|---|---|---|---|---|
| 1 (ASCII) | 0xxxxxxx | |||||
| 2 | 110xxxxx | 10xxxxxx | ||||
| 3 | 1110xxxx | 10xxxxxx | 10xxxxxx | |||
| 4 | 11110xxx | 10xxxxxx | 10xxxxxx | 10xxxxxx | ||
| 5 | 111110xx | 10xxxxxx | 10xxxxxx | 10xxxxxx | 10xxxxxx | |
| 6 | 1111110x | 10xxxxxx | 10xxxxxx | 10xxxxxx | 10xxxxxx | 10xxxxxx |
Example:
E8B387 = 11101000 10110011 10000111
Unicode = 1000 1100 1100 0111 = 8CC7
https://leetcode.com/problems/utf-8-validation/