You are here: Foswiki>Support Web>BestPracticeTips>Utf8MigrationConsiderations>WhatIsUtf8AllAbout (30 Nov 2015, WillNorris)Edit Attach

What is `UTF-8` all about?

To understand what's going on with the UTF-8 migration, you need to look back at the beginnings of electronic communications, even before computers.

ASCII

Early computers communicated with the telegraph hardware. These devices used a 7-bit code, and supported ASCII - The American Standard Code for Information Interchange. 7-bit ASCII supported a total of 128 codes, 33 of which were used to control the old Teletype hardware, and 95 of which were printable. Mostly A-Z, a-z, 0-9, and a few other symbols. Anything outside of those 95 characters are "extended" and often regional in nature. Clearly 95 characters isn't enough to support languages other than English.

Most computers store each character using 8 bits, which could represent 256 unique characters. In order to step beyond these limitations, the industry began to use "Extended ASCII" but because that still really isn't enough and there are regional differences, the particular extension to ASCII could be specified. The extensions were called "Code pages".

Code Pages

Code pages (Foswiki calls them Character Sets) are often, specialized, or regional in nature. They are used to re-assign the remaining 128 "high" characters that ASCII leaves undefined. Foswiki 1.1 allowed the administrator to set on system-wide character set in {Site}{CharSet}. Code pages exist for the Greek alphabet, Eastern European languages, mathematics, graphical symbols (smilies, line drawing, etc.). There are many such pages, but several have either been standardized, or are in very common use.

windows-1252: Also called cp-1252, this code page was standardized by Microsoft for use in Windows. It extends ASCII with commonly used characters supporting European languages, and some non-standard characters used extensively by Microsoft word - "Smart quotes" (separate open and close single and double quotes), and other publishing-related characters. If you have users who copy/paste in text from MS Word, then you are dealing with topics containing these code points.
ISO-8859-xx: These are the international standardized extensions to ASCII. There are 16 variations, the most common being ISO-8859-1 for Western Europe, and ISO-8859-15 which updates -1 with the Euro symbol and several more commonly used letters.

The problem with all this is as they say, "Beauty is in the eye of the beholder"---it's up to the reader of the information to interpret the characters. Foswiki 1.x provides a {Site}{CharSet} configuration parameter, but users cut/paste text, and may have configured their browsers to possibly override the codepage. It's difficult to tell from looking at a block of text what the characters really are. And Foswiki did nothing to try to enforce a character set. What you see is what you get.

Code pages have left us somewhat with the proverbial "Tower of Babel"

UNICODE or UCS (Universal Character Set)

Enter UNICODE. The one character set to rule them all. UNICODE and UCS have the lofty goal of representing every possible character known to humanity, and even some aliens (there is a Klingon set of characters). UNICODE covers thousands or tens of thousands of possible characters, so they have to use more than 8 bits. And this brings us to UTF-8, a multi-byte character set. Each character is identified by 8, 16, or even 32 bytes. As a user, you see characters. Consider the 3 strings below. Each are 3 characters, variations on ABC.

3 Characters	Hex bytes:	Length in the file
"ABC"	41 42 43	3
"ĀƁƇ"	0100 0181 0187	6
"𐌀𐌁𐌂"	103000 013001 103002	9

This example could not have been created on Foswiki 1.1, and that begins to show the flexibility of running the wiki using UNICODE and UTF-8.

Most languages can now be fully and accurately represented. There is no need to compromise on characters and introduce intentional misspellings due to missing characters.
Asian languages can be expressed: 子,孔,字,學
Engineering and Math texts can be represented: ∲ ∴ ⊅ (α + β)/γ

For detailed technical details on UNICODE, see UnderstandingEncodings