Optional: Upload file to encode / decode
Remark: UTF8 Encode / Decode input box limit 2,000 Characters. For a large data, please convert by upload a file.
UTF8 is a Unicode standard encoding which encodes by one to four bytes of 8-bits. UTF8 can represent all existing characters in the world. It is compatible with ASCII encoding because it was designed the same as ASCII binary value. While ASCII encoding using 7 bits and UTF8 are using 8 bits with the same binary value, therefore, ASCII encoding will be a subset of UTF8.
Now, UTF8 become the most popular character encoding for all website. Unfortunately, most people did not notice it because the browser has already been converted it to human characters especially on Non-English characters.
Pros of UTF8 encoding
- UTF8 support many languages.
- Most of the programming languages support UTF8.
- UTF8 is compatible with ASCII
- UTF8 able to convert to other charsets easily by ICONV.
Cons of UTF8 encoding
- UTF-8 uses a variable length encoding especially on high code point, so it hard to determine the number of UTF8 bytes.
- Require encoding module for programming languages.
- UTF8 consume more processing time to find sequence code unit because UTF-8 uses a variable length encoding.
Example – Encode string “₹” to UTF8 hexadecimal. (UTF8 Encode)
- Search for “₹” or rupee sign code point, which is “U+20B9”
2. Convert “20B9” hexadecimal to binary numbers
"20B9" = "0010 0000 1011 1001"
3. Refer to Table UTF8 Code Point Prefix, Binary 16 bits need 3 bytes format below.
Code Point 16 Bits = "1110(XXXX) 10(XXXXXX) 10(XXXXXX)"
Start to rearrange bits from the left-hand side of previous binary 16 bits as UTF8 encoding format.
Rearrange: 0010 0000 1011 1001 -> 0010 000010 111001
Put prefix binary in each byte to rearrange formatted.
UTF8 Prefix: "1110(0010) 10(000010) 10(111001)"
4. Now, you will get 3 bytes of UTF8 binary. Convert all binary back to hexadecimal.
The result of “₹” UTF8 encoding will be
Hexadecimal : E2 82 B9 Hex notation : \xE2\x82\xB9
- Convert all hexadecimal to binary bits.
- Start to read binary bits and determine the starter prefix of each byte as we see in table UTF8 Code Point Prefix.
- Eliminating prefix bits and convert binary data back to Unicode code point.
- Mapping code point back to a string.