Monday, January 22, 2007

Facts about Unicode which you might not know

  1. There is no such thing called "Plain Text". All text and strings are byte data stored in memory and disk. So a better word for previous "Ascii string" might be "binary string" or "byte string".
  2. The operation system needs a way to transform meaningless in memory binary data into meaningful string to display on monitor, print on paper. In the old time before Unicode, this semi-standard byte-to-string mapping table is called "Code Page".
  3. A byte can contain 256 characters with the low 128 characters common agreed as "ASCII" character set, but for the rest 128 characters, different country/culture choose a different way to use them. Thus a 0xc5 should be different in Hebrew and Russian language.
  4. IBM PC has defined one way to use 128-255 byte space, which is common called "latin-1" character set, or "iso-8259-1". This character set is so standard that many modern softwares, like MySQL, still makes it the default encoding for database connection, which causes tons of headaches. Someone even calls it "Character Set Hell".
  5. Chinese and other east asia characters requires at least two bytes to represent a word. Thus, the need for a multi-byte string. In Window, this is where the notorious MBCS comes into play. Plus the tons of methods in ATL/MFC to convert from one type to another.
  6. A more precise way for "UTF-8" is "an unicode string encoded using UTF-8 encoding rules". So first of all, UTF-8 can be used in transmission from one machine to another, but inside a computer, it still needs to be converted into bytes. UTF-8 is not Unicode.
  7. Unicode is a standard way to define how to represent any given character in memory. It has nothing related to how the character rendered on the monitor or on paper. It's only a world-level unique way to ensure that given any character, no matter it's a Thai, Chinese, Russian, Hebrew or French, there is only one single way to represent it in binary format.
  8. Unicode, as an internal representation, looks like like U+0048 U+0065 U+006C. When it's encoded into ASCII encoding and saved, it'll be the as 48 65 6C. The ASCII string could, understandably, be decoded into Unicode representation.
  9. The decode process requires an explicit definition of which encoding the input string is "encoded" in. Otherwise, the system level default encoding will be used.
  10. Major modern database software, Oracle, MSSQL, MySQL, PostgreSQL, SAP DB, Sqlite, all supports storing unicode string as content.
  11. UTF-8 is a preferable way to transmit and store data in because it has two unique characteristics: 1) UTF-8 is fully compatible with ASCII character set since it either uses one byte for 0-128 characters, or three bytes for the rest. Since most network applications, especially HTTP protocol accepts only ASCII character set, UTF-8 is the only encoding choice that's compatible with these network protocols. 2) UTF-8 contains the same character set as defined in Unicode. The process of encoding Unicode into ASCII is downcasting because it shrinks the amount of characters that could be represented.
  12. In order to make your web pages display correctly, it's recommended to include a META tag in your HTML code: meta equiv="Content_Type" charset="UTF-8". If you're using Apache, you could add AddDefaultCharset utf-8 into your httpd.conf to let Apache server add this to all the web pages your website is servicing.
  13. If you need to make your web forms have multi-lingual support, like Google's search textbox, you'll need to
References
I've done a quite extensive reading on this topic. Basically I found all the articles on this topic and continue reading them until I could no longer find any concepts that's new. Here are a few links I'd recommend:

No comments: