UTF16 Encode Decode – Convert String to UTF16

UTF16 encoder/decoder – Online converter tools, Encode/Decode strings to UTF16 and vice versa with interactive UTF16 encoding algorithm by ConvertCodes.

Text

    

UTF16

Additional Condition

Optional: Upload file to encode / decode

     Source File Encoding
     

Remark: UTF16 Encode / Decode input box limit 10,000 Characters. For a large data, please convert by upload a file.

UTF16 is a Unicode standard encoding which encodes by one or two 16-bits binary with less than UTF8 (1-4 bytes of 16 bits binary). Some programming languages are using UTF16 encodings such as Windows OS, Java (Oracle) and Javascript language.

U+0000 to U+D7FF and U+E000 to U+FFFF, both of these ranges are Basic Multilingual Plane (BMP). That means, UTF16 can store most of the basic characters in only one byte. Some of the Asian, Middle-eastern and African language characters will fall out to Supplementary Planes (U+010000 to U+10FFFF). This Unicode range required two bytes for UTF16 encoding because it is greater than 65,536 (216). We always see it as emoji and emoticon sign.

Pros of UTF16 encoding

  1. All Basic Multilingual Plane (BMP) or basic language characters in common use can represent in 2 bytes on UTF16 encode.
  2. UTF16 encoding has a good programming performance because of indexing is faster to calculate code point than UTF8.
  3. Even UTF16 is represented by pairs of 16-bits binary. The common use of characters still in the first half of total range which requires only 1 byte (16-bits) to calculate and convert data. 

Cons of UTF16 encoding

  1. In fact of real usage encoding data, the range of Unicode use is only in ASCII string or only first 128 characters. It means that the UTF16 encoding data will have a lot of null bytes which result in wasted of memory.
  2. Required more bandwidth to transmit over the network because of no null-terminated string.

Example – Encode string “𤭢” to UTF16 hexadecimal. (UTF16 Encode)

  1. Search for “𤭢” code point, which is “U+24B62”
  2. In case of a code points is over than U+10000, UTF16 encoding requires 2 bytes of 16 bits binary. We have to subtract 0x10000 from the code points. Now the remaining is “14B62”
  3. Convert “14B62” to binary numbers.
HexadecimalBinary
10001
40100
B1011
60110
20010
"14B26" = 0001 0100 1011 0110 0010

 4. We have to subtract all 20 binary bits into 2 part.

  • The First 10 bits is a high surrogate
  • The last 10 bits is a low surrogate
0001 0100 1011 0110 0010
High : 0001010010
Low : 1101100010

  5. Adding “0xD800” to high surrogate.

D800 : 1101 1000 0000 0000
High : 0000 0000 0101 0010
= 1101 1000 0101 0010
= D852

The high surrogate 16-bits encoding will be “0xD852”

 6. Adding “0xDC00” to low surrogate.

DC00 : 1101 1100 0000 0000
low : 0000 0011 0110 0010
= 1101 1111 0110 0010
= DF62

The low surrogate 16-bits encoding will be “0xDF62”

7. The result of  “𤭢” or “U+24B62” will be 0xD852 0xDF62 or \uD852\uDF62

Example – Decode UTF16 hexadecimal \uD852\uDF62 (UTF16 Decode)

  1. Convert UTF16 hexadecimal to binary.
D852 : 1101 1000 0101 0010
DF62 : 1101 1111 0110 0010

 2. Subtract high surrogate (0xD852) with 0xD800

D852 : 1101 1000 0101 0010
D800 : 1101 1000 0000 0000
= 0000 0000 0101 0010
= 52

Then multiply by 0x0400, which is “0x14800”

 4. Subtract low surrogate (0xDF62) with 0xDC00

DF62 : 1101 1111 0110 0010
DC00 : 1101 1100 0000 0000
= 0000 0011 0110 0010
= 362

The result is “0x362”

 5. Adding two value above (0x14800 + 0x362) and also add 0x10000 to get the final result of code points

0x14800 : 0001 0100 1000 0000 0000
0x362 : 0000 0000 0011 0110 0010
0x10000 : 0001 0000 0000 0000 0000
= 0010 0100 1011 0110 0010
= 24B62

The result of decoding UTF16 “\uD852\uDF62” is U+24B62

Close Menu
%d bloggers like this:
Close Menu