UTF16 encoder/decoder – Online converter tools, Encode/Decode strings to UTF16 and vice versa with interactive UTF16 encoding algorithm by ConvertCodes.
Text
UTF16
Additional Condition
Optional: Upload file to encode / decode
Remark: UTF16 Encode / Decode input box limit 10,000 Characters. For a large data, please convert by upload a file.
UTF16 is a Unicode standard encoding which encodes by one or two 16-bits binary with less than UTF8 (1-4 bytes of 16 bits binary). Some programming languages are using UTF16 encodings such as Windows OS, Java (Oracle) and Javascript language.
U+0000 to U+D7FF and U+E000 to U+FFFF, both of these ranges are Basic Multilingual Plane (BMP). That means, UTF16 can store most of the basic characters in only one byte. Some of the Asian, Middle-eastern and African language characters will fall out to Supplementary Planes (U+010000 to U+10FFFF). This Unicode range required two bytes for UTF16 encoding because it is greater than 65,536 (216). We always see it as emoji and emoticon sign.
Pros of UTF16 encoding
- All Basic Multilingual Plane (BMP) or basic language characters in common use can represent in 2 bytes on UTF16 encode.
- UTF16 encoding has a good programming performance because of indexing is faster to calculate code point than UTF8.
- Even UTF16 is represented by pairs of 16-bits binary. The common use of characters still in the first half of total range which requires only 1 byte (16-bits) to calculate and convert data.
Cons of UTF16 encoding
- In fact of real usage encoding data, the range of Unicode use is only in ASCII string or only first 128 characters. It means that the UTF16 encoding data will have a lot of null bytes which result in wasted of memory.
- Required more bandwidth to transmit over the network because of no null-terminated string.
Example – Encode string “𤭢” to UTF16 hexadecimal. (UTF16 Encode)
- Search for “𤭢” code point, which is “U+24B62”
- In case of a code points is over than U+10000, UTF16 encoding requires 2 bytes of 16 bits binary. We have to subtract 0x10000 from the code points. Now the remaining is “14B62”
- Convert “14B62” to binary numbers.
Hexadecimal | Binary |
---|---|
1 | 0001 |
4 | 0100 |
B | 1011 |
6 | 0110 |
2 | 0010 |
"14B26" = 0001 0100 1011 0110 0010
4. We have to subtract all 20 binary bits into 2 part.
- The First 10 bits is a high surrogate
- The last 10 bits is a low surrogate
0001 0100 1011 0110 0010
High : 0001010010
Low : 1101100010
5. Adding “0xD800” to high surrogate.
D800 : 1101 1000 0000 0000
High : 0000 0000 0101 0010
= 1101 1000 0101 0010
= D852
The high surrogate 16-bits encoding will be “0xD852”
6. Adding “0xDC00” to low surrogate.
DC00 : 1101 1100 0000 0000
low : 0000 0011 0110 0010
= 1101 1111 0110 0010
= DF62
The low surrogate 16-bits encoding will be “0xDF62”
7. The result of “𤭢” or “U+24B62” will be 0xD852 0xDF62 or \uD852\uDF62
Example – Decode UTF16 hexadecimal \uD852\uDF62 (UTF16 Decode)
- Convert UTF16 hexadecimal to binary.
D852 : 1101 1000 0101 0010
DF62 : 1101 1111 0110 0010
2. Subtract high surrogate (0xD852) with 0xD800
D852 : 1101 1000 0101 0010
D800 : 1101 1000 0000 0000
= 0000 0000 0101 0010
= 52
Then multiply by 0x0400, which is “0x14800”
4. Subtract low surrogate (0xDF62) with 0xDC00
DF62 : 1101 1111 0110 0010
DC00 : 1101 1100 0000 0000
= 0000 0011 0110 0010
= 362
The result is “0x362”
5. Adding two value above (0x14800 + 0x362) and also add 0x10000 to get the final result of code points
0x14800 : 0001 0100 1000 0000 0000
0x362 : 0000 0000 0011 0110 0010
0x10000 : 0001 0000 0000 0000 0000
= 0010 0100 1011 0110 0010
= 24B62
The result of decoding UTF16 “\uD852\uDF62” is U+24B62