17.10.2023

How to work with Unicode in Python?

What is it?

Unicode serves as a character encoding standard encompassing virtually all written languages worldwide and is currently the dominant standard for the Internet.

At its core, Unicode is a systematic approach to representing text in a linear fashion. Each Unicode character is assigned a unique code, an integer ranging from 0 to 1,114,111. These codes play a crucial role in representing characters across computer systems, including operating systems, web browsers, and various applications.

Unicode boasts several advantages when compared to alternative character encoding systems like ASCII and ISO 8859-1. It offers comprehensive support for a vast array of characters, covering all the written languages worldwide. Moreover, Unicode's support for multibyte codes enables the representation of intricate characters, such as hieroglyphs.

But needed to highlight, Unicode isn’t an algorithm of code, that's a set of code point’s, like a database of indicators. Therefore, there are code algorithms, which use that base, for example: UTF-8, UTF-16, UTF-32, UCA, BIDI, etc.

How to use it?

Python already support Unicode and have in the standard template UTF-8/16/32 algorithm, we should use several instruction .encode and .decode to convert data:

print("That’s data!".encode("utf-16"))
#Output b'\xff\xfeT\x00h\x00a\x00t\x00 \x00i\x00s\x00 \x00d\x00a\x00t\x00a\x00!\x00'

In the output we can see a set of separated data by slash sign and x is a 16-bit coding system. Also we can decode into latinics letters by the command:

print(b'\xff\xfeT\x00h\x00a\x00t\x00 \x00i\x00s\x00 \x00d\x00a\x00t\x00a\x00!\x00'.decode("utf-16"))
#Output 'That is data!'

As we highlighted before, Unicode, therefore UTF-8 supports Arabic letters, hieroglyphs and most of the written language:

print("العربية".encode("utf-16"))
#Output b"\xff\xfe'\x06D\x069\x061\x06(\x06J\x06)\x06"

And at the same way as we described above - decode:

print(b"\xff\xfe'\x06D\x069\x061\x06(\x06J\x06)\x06".decode("utf-16"))
#Output 'العربية'

Also you may don’t identify algorithm to convert the value, for example:

print("Hi, my name is Jhon!".encode())
#Output b'Hi, my name is Jhon!'

And decode at the same way:

print(b'Hi, my name is Jhon!'.decode())
#Output 'Hi, my name is Jhon!'

There is method which allowed to display value of convert data to decimal calculus system:

print(list(b"\xff\xfe'\x06D\x069\x061\x06(\x06J\x06)\x06"))
#Output [255, 254, 39, 6, 68, 6, 57, 6, 49, 6, 40, 6, 74, 6, 41, 6]

List can be helpful in the different scenarios for your program and utility. However, not all algorithms use Unicode in the same way, if we try to represent our text message "Here’s my data!", therefore we will get various response:

print("Here’s my data!".encode("utf-16"))
#Output b'\xff\xfeH\x00e\x00r\x00e\x00\x19 s\x00 \x00m\x00y\x00 \x00d\x00a\x00t\x00a\x00!\x00'

And the try to decode into the utf-8 or utf-32, we will see:

print(b'\xff\xfeH\x00e\x00r\x00e\x00\x19 s\x00 \x00m\x00y\x00 \x00d\x00a\x00t\x00a\x00!\x00'.decode("utf-8"))
#Output UnicodeDecodeError: 'utf-8' codec can't decode
print(b'\xff\xfeH\x00e\x00r\x00e\x00\x19 s\x00 \x00m\x00y\x00 \x00d\x00a\x00t\x00a\x00!\x00'.decode("utf-32"))
#Output UnicodeDecodeError: 'utf-32' codec can't decode

In the two cases we get messages about impossibilities to decode data in that way!

Conclusion

Unicode stands as a pivotal character encoding standard that encompasses the vast array of written languages worldwide, making it the dominant standard for the Internet. This systematic approach assigns a unique code to each Unicode character, allowing for seamless representation across various computer systems and applications.