Electronic communication has changed how the world is connected. It started long before Twitter, or television, or even the telephone, and the secret lies in keying and digital coding.

The earliest, widely-used method of electronic communication was the electric telegraph. We may not think of this as “digital,” but it used on/off keying to transmit messages. Although more complex signaling methods were attempted, the most common telegraph was a simple on/off circuit switched by a telegraph key, operating a sounder at the receiving end. Morse code used this on/off keying to send characters and numbers (Figure 1).

Morse code is simply on/off keying

Figure 1 Simple on/off keying is often used to send text messages, as in this Morse Code ”R.”

Baudot

As radioteletype developed, the Baudot code (also called the Murray code) became the dominant coding format. Imagine that you’d want to encode 26 English letters, the numbers 0 through 9, and a dozen or so punctuation, math, and control codes. That requires about 50 unique codes that must be sent. Baudot uses just 5 bits, which can represent 25 codes (32 unique codes). To support additional codes, the Baudot code makes use of a state change, controlled by the letter shift (LTRS) code and the figure shift (FIGS) code, which allows it to support roughly twice as many characters. The FIGS code (11011) signals that subsequent characters are to be interpreted as being in the FIGS set, until this is reset by the LTRS (11111) code. This is a workable technique, but it does require that the receiving unit keep track of the state of the system. Still, the Baudot code is limited in the characters that can be represented.

ASCII

In 1968, the American National Standards Institute adopted the American National Standard Code for Information Interchange (ASCII), commonly pronounced “ask-key.” ASCII started out as a 7-bit code, including the 26 characters (upper and lower case) for the English language, numbers 0 through 9, and many other characters and control codes (Figure 2). This 26-character alphabet used for English is more correctly called the Latin alphabet.

Note that ASCII found widespread use for both transmission of information and storage of information. Over time, an uncountable number of text files have been created and stored away encoded as ASCII.

ASCII code

Figure 2 The basic 7-bit ASCII code supports the Latin alphabet (upper and lower case), numbers, characters, and control codes. (Source: The Languages of Launchpad, Part 2)

The ASCII code was designed with an eye toward digital manipulation of the characters. For example, number characters (0, 1, 3, etc.) are in order starting with 30 hexadecimal (hex). That is, the character 0 is 30 hex, the character 1 is 31 hex, etc. Stripping off the upper bits provides for easy conversion from ASCII code to the digital value.

Similarly, uppercase and lowercase letters are numbered for easy conversion. Capital letters start with 41 hex (letter A), followed by 42 hex (B), and so forth. Lower case letters start at 61 hex (a), then 62 hex (b), etc. Converting from uppercase to lowercase means simply setting a 1 in the proper bit location.

“Basic ASCII” uses the numbers from 0 to 127 (0 to 7F hex), using just 7 bits. An extended character set (using an additional bit) has been added to the standard. Actually, several incompatible extended character sets have emerged over time. These characters are coded as 80 to FF hex. The extended characters include some special characters found in non-English languages, some additional math symbols and basic graphics symbols.

Unicode and UTF-8

While ASCII has served us well for languages that use the Latin character set, it clearly falls short for other languages. To address this issue, the Unicode standard was developed, currently with 137,994 characters defined (Unicode 12.1). Think of Unicode as a defined set of characters all mapped to a specific set of 32-bit numbers.

The set of Unicode characters is very extensive, including characters for many diverse languages, mathematical operators, geometric shapes, chess symbols, mahjong tiles, and emoticons. We won’t try to list them all here, but you can browse the character set using the URL in Reference 3. The ASCII character set remains intact in the Unicode standard, referred to as the “Basic Latin” script.

UTF-8 is a standard that defines the way Unicode is encoded in actual applications. UTF-8 has become dominant on the worldwide web, used in over 94% of the websites.

It uses an efficient encoding technique so that ASCII characters still only consume one byte while UTF-8 is still supporting up to four bytes to handle the huge number of other characters. Without this 8-bit ASCII feature, existing files that use just ASCII would expand by a factor of 4 due to inefficient coding.

Tom Scott has an interesting video explanation of the “miracle of Unicode” that is worth a look (Reference 5 and below). Unicode will probably continue to be revised and extended, but hopefully it will serve us for many years to come. The 32-bit format is a huge improvement over Morse code, but at its core it’s still just on/off keying.

Bob Witte is President of Signal Blue LLC, a technology consulting company.