Let ChatGPT to Be Your Pair Programmer

Unlick Github Copilot, OpenAI does not offer an IDE plugin to help devs for coding, but it does not mean, we cannot use ChatGPT for free as a coding assistant. It was mentioned the color schema in…

Smartphone

独家优惠奖金 100% 高达 1 BTC + 180 免费旋转




Unicode

Unicode is an encoding for textual characters which is able to represent characters from many different languages from around the world.
Each character is represented by a unicode code point. A code point is an integer value that uniquely identifies the given character. Unicode characters can be encoded using different encodings, like UTF-8 or UTF-16.
These encodings specify how each character’s Unicode code point is encoded, as one or more bytes. Each encoding will represent the characters as bytes according to their own scheme.

Each unicode character is represented by a unicode code point which is an integer value. The code point integer values go from 0 to 10FFFF (in hexadecimal encoding).

When referring to a unicode code point in writing, we write a U+ and then the hexadecimal representation of the code point. For instance, the uppercase character A is represented as U+0041. This notation is only when referring to the code points in text, though.

On the byte encoding level the unicode characters are encoded differently. The uppercase character A does not need 6 bytes (6 ascii characters U+0041) when encoded as bytes. The exact number of bytes used depends on whether you are using an UTF-8, UTF-16 encoding etc.

UTF-8 is the a very commonly used textual encoding on the web, and is thus very popular. Web browsers understand UTF-8. Several textual data formats and markup languages are often encoded in UTF-8. For instance JSON, XML, HTML, CSS, SVG etc.

UTF-8 uses 1, 2, 3 or 4 bytes to represent a unicode character (or unicode code point.)

As already stated, UTF-8 can use 1,2,3 or 4 bytes to represent a unicode character, so while reading a UTF-8 encoded bytes into character first we need a way to determine the number of bytes used to encode the character.
We can do so by looking at the bit pattern of the first byte.

If the first byte has the bit pattern 0ZZZZZZZ (most significant byte is a 0) then the character code point is represented only by this byte.

If the first byte has the bit pattern 110YYYYY (3 most significant bits are 110) then the character code point is represented by two bytes.

If the first byte has the bit pattern 1110XXXX (4 most significant bits are 1110) then the character code point is represented by three bytes.

If the first byte has the bit pattern 11110VVV (5 most significant bits are 11110) then the character code point is represented by four bytes.

Once you know how many bytes is used to represent the given character code point, read all the actual code point carrying bits (bits marked with V, W, X, Y and Z), into a single 32 bit data type (e.g a Java int). The bits then make up the integer value of the code point.

Add a comment

Related posts:

A Deep Dive Into CustomPaint in Flutter

If you are starting out with Flutter, you may not have heard of or used the CustomPaint widget a lot. Most widgets in the framework take care of common features and functionality and even when they…

December 2021 and the Beginning of My Fight With COVID

December began peacefully enough. I started the month with gathering photos to fit Flickr themes and doing a photo walk to confirm some mapping I had done on some Flickr photos. Three days later on…

What is ICO OF KOREA?

I am guessing the term ICO, or “Initial Coin Offering”, is quite familiar to a lot of people at this point. ICO OF KOREA, however, is definitely something new for most of the people on medium. Korea…