Unicode and Character Encoding

In programming, characters are fundamental units of text representation. In C, characters are stored as numeric values using character encoding schemes. Understanding Unicode and character encoding is crucial for handling text effectively in C programs.

Basics of Character Encoding

In early computing, character sets like ASCII (American Standard Code for Information Interchange) were used to represent characters. ASCII encoded characters using 7 bits, accommodating 128 characters including English letters, digits, and symbols.

Using ASCII in C

				
					#include <stdio.h>

int main() {
    char ch = 'A';
    printf("ASCII value of %c is %d\n", ch, ch);
    return 0;
}

				
			
				
					// output //
ASCII value of A is 65

				
			

Explanation: In this code, ‘A’ is represented by the ASCII value 65.

Limitations of ASCII

ASCII had limitations in representing characters from various languages and special symbols. Unicode was introduced to address these limitations by providing a universal character set.

Introduction to Unicode

Unicode is a character encoding standard that aims to represent every character in every language. It assigns unique numeric values to characters and symbols from various writing systems, including alphabets, ideograms, and emojis.

Unicode Encoding Schemes

Unicode can be encoded using different schemes such as UTF-8, UTF-16, and UTF-32. Each scheme represents Unicode characters using different byte sequences.

UTF-8 Encoding

UTF-8 is a variable-length encoding scheme for Unicode characters. It uses 8-bit code units to represent characters. ASCII characters are represented using a single byte, while other characters are represented using multiple bytes.

Using UTF-8 in C

				
					#include <stdio.h>

int main() {
    char utf8[] = u8"नमस्ते"; // UTF-8 encoded "Namaste" in Hindi
    printf("UTF-8 string: %s\n", utf8);
    return 0;
}

				
			
				
					// output //
UTF-8 string: नमस्ते

				
			

Explanation: In this code, the UTF-8 encoded string “नमस्ते” is printed, which means “Namaste” in Hindi.

UTF-16 and UTF-32 Encoding

UTF-16 and UTF-32 are fixed-width encoding schemes for Unicode characters. UTF-16 uses 16-bit code units, while UTF-32 uses 32-bit code units. They can represent all Unicode characters directly without variable-length encoding.

Handling Unicode Characters in C

C supports Unicode characters through wide character types like wchar_t and functions prefixed with ‘w’ in the standard library. These types facilitate working with wide characters, enabling Unicode support in C programs.

Using Wide Characters in C

				
					#include <stdio.h>
#include <wchar.h>

int main() {
    wchar_t wc[] = L"😊"; // Wide character representing a smiley emoji
    wprintf(L"Wide character: %ls\n", wc);
    return 0;
}

				
			
				
					// output //
Wide character: 😊

				
			

Explanation: This code demonstrates the usage of wide characters to print a smiley emoji.

ASCII values

				
					+-----+-------------+-----+-------------+-----+-------------+-----+-------------+
| Dec |   Char      | Dec |   Char      | Dec |   Char      | Dec |   Char      |
+-----+-------------+-----+-------------+-----+-------------+-----+-------------+
|  0  |   NUL       |  32 |   SPACE     |  64 |   @         |  96 |   `         |
|  1  |   SOH       |  33 |   !         |  65 |   A         |  97 |   a         |
|  2  |   STX       |  34 |   "         |  66 |   B         |  98 |   b         |
|  3  |   ETX       |  35 |   #         |  67 |   C         |  99 |   c         |
|  4  |   EOT       |  36 |   $         |  68 |   D         | 100 |   d         |
|  5  |   ENQ       |  37 |   %         |  69 |   E         | 101 |   e         |
|  6  |   ACK       |  38 |   &         |  70 |   F         | 102 |   f         |
|  7  |   BEL       |  39 |   '         |  71 |   G         | 103 |   g         |
|  8  |   BS        |  40 |   (         |  72 |   H         | 104 |   h         |
|  9  |   TAB       |  41 |   )         |  73 |   I         | 105 |   i         |
| 10  |   LF        |  42 |   *         |  74 |   J         | 106 |   j         |
| 11  |   VT        |  43 |   +         |  75 |   K         | 107 |   k         |
| 12  |   FF        |  44 |   ,         |  76 |   L         | 108 |   l         |
| 13  |   CR        |  45 |   -         |  77 |   M         | 109 |   m         |
| 14  |   SO        |  46 |   .         |  78 |   N         | 110 |   n         |
| 15  |   SI        |  47 |   /         |  79 |   O         | 111 |   o         |
| 16  |   DLE       |  48 |   0         |  80 |   P         | 112 |   p         |
| 17  |   DC1       |  49 |   1         |  81 |   Q         | 113 |   q         |
| 18  |   DC2       |  50 |   2         |  82 |   R         | 114 |   r         |
| 19  |   DC3       |  51 |   3         |  83 |   S         | 115 |   s         |
| 20  |   DC4       |  52 |   4         |  84 |   T         | 116 |   t         |
| 21  |   NAK       |  53 |   5         |  85 |   U         | 117 |   u         |
| 22  |   SYN       |  54 |   6         |  86 |   V         | 118 |   v         |
| 23  |   ETB       |  55 |   7         |  87 |   W         | 119 |   w         |
| 24  |   CAN       |  56 |   8         |  88 |   X         | 120 |   x         |
| 25  |   EM        |  57 |   9         |  89 |   Y         | 121 |   y         |
| 26  |   SUB       |  58 |   :         |  90 |   Z         | 122 |   z         |
| 27  |   ESC       |  59 |   ;         |  91 |   [         | 123 |   {         || 28  |   FS        |  60 |   <         |  92 |   \         | 124 |   |         || 29  |   GS        |  61 |   =         |  93 |   ]         | 125 |   }         |
| 30  |   RS        |  62 |   >         |  94 |   ^         | 126 |   ~         |
| 31  |   US        |  63 |   ?         |  95 |   _         | 127 |   DEL       |
+-----+-------------+-----+-------------+-----+-------------+-----+-------------+

				
			

Character encodings

				
					+--------------+-----------------------------+-------------------------+
|   Encoding   |       Description           |         Example         |
+--------------+-----------------------------+-------------------------+
|     ASCII    |   American Standard Code    |            A            |
|              |  for Information Interchange|                         |
+--------------+-----------------------------+-------------------------+
|     UTF-8    |   Unicode Transformation    |         नमस्ते         |
|              |    Format 8-bit             |                         |
+--------------+-----------------------------+-------------------------+
|    UTF-16    |   Unicode Transformation    |       U+1F60A (😊)      |
|              |    Format 16-bit            |                         |
+--------------+-----------------------------+-------------------------+
|    UTF-32    |   Unicode Transformation    |    U+1F60A (0001F60A)   |
|              |    Format 32-bit            |                         |
+--------------+-----------------------------+-------------------------+
|     ISO      |  International Organization |         é (Latin-1)      |
|   8859-1     |  for Standardization        |                         |
+--------------+-----------------------------+-------------------------+
|     Windows  |   Microsoft Windows         |            é            |
|   1252/ANSI  |   Code Page 1252/ANSI      |                         |
+--------------+-----------------------------+-------------------------+
|     EBCDIC   |   Extended Binary Coded     |            A            |
|              |   Decimal Interchange Code  |                         |
+--------------+-----------------------------+-------------------------+
|     KOI8-R   |   Russian                   |          Привет         |
|              |   (RFC 1489)                |                         |
+--------------+-----------------------------+-------------------------+
|    Shift-JIS |   Japanese                  |         こんにちは      |
|              |                             |                         |
+--------------+-----------------------------+-------------------------+
|     GB2312   |   Simplified Chinese        |         你好            |
|              |   (EUC-CN, EUC-GB)          |                         |
+--------------+-----------------------------+-------------------------+
|    Big5      |   Traditional Chinese       |         你好            |
|              |   (EUC-TW, Big5-HKSCS)      |                         |
+--------------+-----------------------------+-------------------------+

				
			

Character encodings example

				
					#include <stdio.h>

int main() {
    // ASCII
    char ascii[] = "Hello, World!"; // ASCII string
    printf("ASCII string: %s\n", ascii);

    // UTF-8
    char utf8[] = u8"नमस्ते"; // UTF-8 encoded "Namaste" in Hindi
    printf("UTF-8 string: %s\n", utf8);

    // UTF-16
    wchar_t utf16[] = L"\u03A9\u03B2\u03B3\u03B4"; // UTF-16 encoded Greek letters
    wprintf(L"UTF-16 string: %ls\n", utf16);

    // UTF-32
    char32_t utf32[] = U"\U0001F60A"; // UTF-32 encoded smiley emoji
    printf("UTF-32 string: %s\n", utf32);

    // ISO 8859-1 (Latin-1)
    char iso8859[] = "\xE9"; // ISO 8859-1 encoded "é"
    printf("ISO 8859-1 string: %s\n", iso8859);

    // Windows 1252/ANSI
    char windows1252[] = "\xE9"; // Windows 1252/ANSI encoded "é"
    printf("Windows 1252/ANSI string: %s\n", windows1252);

    // EBCDIC
    char ebcdic[] = "\xC1"; // EBCDIC encoded "A"
    printf("EBCDIC string: %s\n", ebcdic);

    // KOI8-R
    char koi8r[] = "\xDF\xF0\xE8\xE2\xE5\xF2"; // KOI8-R encoded "Привет"
    printf("KOI8-R string: %s\n", koi8r);

    // Shift-JIS
    char shiftjis[] = "\x82\xA0\x82\xA2\x82\xA4\x82\xA6\x82\xA8"; // Shift-JIS encoded "こんにちは"
    printf("Shift-JIS string: %s\n", shiftjis);

    // GB2312
    char gb2312[] = "\xC4\xE3\xBA\xC3"; // GB2312 encoded "你好"
    printf("GB2312 string: %s\n", gb2312);

    // Big5
    char big5[] = "\xC4\xE3\xBA\xC3"; // Big5 encoded "你好"
    printf("Big5 string: %s\n", big5);

    return 0;
}

				
			
				
					// output //
ASCII string: Hello, World!
UTF-8 string: नमस्ते
UTF-16 string: Ωβγδ
UTF-32 string: 😊
ISO 8859-1 string: é
Windows 1252/ANSI string: é
EBCDIC string: Á
KOI8-R string: Привет
Shift-JIS string: こんにちは
GB2312 string: 你好
Big5 string: 你好

				
			

Understanding Unicode and character encoding is essential for developing C programs that handle text effectively. With Unicode support, C programs can work with a wide range of characters from different languages and writing systems, enabling internationalization and localization capabilities. By utilizing Unicode encoding schemes like UTF-8, UTF-16, and UTF-32, C programmers can ensure compatibility and proper representation of text across different platforms and environments.Happy coding!❤️

Table of Contents