UTF: the different encoding systems based on the Unicode encodings

Short for “Unicode Transformation Format,” there are four different Unicode formats that can be used to encode characters.

Most word processors, web pages and software programs use on of the four UTF’s, as they tend to be very inclusive of a vast selection of languages as opposed to other systems.

The following are the four types: UTF- 7 uses (as the name implies) 7 bits for each character. It is a variable- length encoding system and its whole purpose was to represent ASCII characters in email messages that forbid encoding of headers using byte values above the ASCII range.

This allowed the encoding of characters even in these restrictive conditions. UTF- 8 is the most common form of Unicode. It is used on virtually all common platforms of web pages and software programs.

UTF breaks the characters into four groups, separated by how many bytes they are represented by: common English characters are represented by one byte; Latin, Hebrew and Arabic are each encoded with two bytes; the Asian characters are encoded with three bytes; and all others are four bytes long.

UTF- 16 was an earlier attempt at a universal encoding system like what UTF- 8 eventually became. It nearly fell through after it became clear that the 2^16 possibilities weren’t enough to cover the number of characters, and the Unicode consortium reps wouldn’t allow the 31- bit to progress.

Thus the UTF- 16 was a compromise, but one fraught with holes and incomplete sets. In addition, it wasn’t as inclusive as the later UTF- 8.

Finally, UTF- 32 is a coding system that encodes every character with 32 bits. This is the only fixed- length Unicode format.

Read more