Previous Page
Next Page

4.8. Using UTF-8 with MySQL

As with PHP, as long as your medium supports raw bytes streams, then it supports UTF-8. MySQL does indeed support byte streams, so storing and retrieving UTF-8-encoded strings works in just the same way as storing plain ASCII text.

If we can read and write data, what else is left? As with PHP, there are a couple of important issues. Sorting, something you often want to do in the database layer rather than the code layers, will also need to work with our Unicode data. Luckily for us, as we already discussed, UTF-8 can be binary sorted and comes out in code point order. This means that the regular MySQL sort works fine with your UTF-8 data, as long as you define your columns with the BINARY attribute (for CHAR and VARCHAR columns) and use BLOB instead of TEXT types.

As with PHP, the thing we need to worry about is string manipulation. You can usually avoid most string manipulation by moving logic from your SQL into your code layers. Avoid using SQL statements of this type:

SELECT SUBSTRING(name, 0, 1) FROM UserNames;

Instead, move the same logic into your business logic layer:

<?php
 $rows = db_fetch_all("SELECT name FROM UserNames;");
foreach($rows as $k => $v){
          $rows[$k]['name'] = mb_substr($v['name'], 0, 1);
  }
>

In some cases this is going to be a problemif you were using a substring operation in your SQL to select or join against, then you'll no longer be able to perform that operation. The alternative is either to have character set support inside your database (which we'll talk about in a moment) or to lay out your data in such a way that you simplify the query. For instance, if you were performing a substring operation to group records by the first character in a certain field, you could store the first character (as a set of normalized code points) in a separate field and use that field directly, avoiding any string operations inside the database.

MySQL also has another set of string manipulation functions that it uses in the background, which you can easily miss. To create FULLTEXT indexes, MySQL needs to chop up the input string into different words to individually index. Without support for UTF-8, Unicode strings will be incorrectly sliced up for indexing, which can return some really bizarre and unexpected results.

Unlike the explicit string manipulation functions, there's no way you can move the text-indexing logic into your code layers without rewriting the text indexer from scratch. Since a text indexer is a fairly sophisticated piece of code, and somebody has already written for us in the shape of MySQL's FULLTEXT indexes, it would be a big waste of your time to implement yourself.

Luckily, MySQL version 4.1 saved us from doing any work; it comes with support for multiple character sets and collations, including UTF-8. When you create a table, you can specify per column character sets, or you can set a default for a server, database, or table to avoid having to be specific every time you create a new column. Data in this column is then stored in that format, regular string manipulation functions can be used, and FULLTEXT indexes work correctly.

It also has the nice benefit of changing column-length specifications from bytes to characters. Previous to version 4.1, a MySQL column type of CHAR(10) meant 10 bytes, so you could store between 2 and 10 UTF-8 characters. In version 4.1, CHAR(10) means 10 characters and so might take up 10 or more bytes. If you're concerned about space, you should avoid using the CHAR type (and instead use VARCHAR) as a CHAR(10) column actually needs 30 bytes to account for each of the 10 characters potentially having 3 bytes.

MySQL currently has a limitation of 3 bytes per characters for UTF-8, which means it can't store code points above U+FFFF. This probably isn't an issue for most people: this region contains musical symbols, Old Persian characters, Aegean numbers, and other such oddities. But it's worth bearing in mind that some code points can't be stored, and you might want to account for this in your data-filtering code.


Previous Page
Next Page