Section 4.7. Using UTF-8 with Other Languages

4.7. Using UTF-8 with Other Languages

The techniques we've talked about with PHP apply equally well to other languages that lack core Unicode support, including Perl versions previous to 5.6 and older legacy languages. As long as the language can transparently work with streams of bytes, we can pass around strings as opaque chunks of binary data. For any string manipulation or verification, we'll need to shell out to a dedicated library such as iconv or ICU to do the dirty work.

Many languages now come with full or partial Unicode support built in. Perl versions 5.8.0 and later can work transparently with Unicode strings, while version 5.6.0 has limited support using the use utf8 pragma. Perl 6 plans to have very extensive Unicode support, allowing you to manipulate strings at the byte, code point, and grapheme levels. PHP 6 plans to have Unicode support built right into the language, which should make porting existing code a fairly painless experience. Ruby 1.8 has no explicit Unicode supportlike PHP, it treats strings as sequences of 8-bit bytes. Unicode support of some kind is planned for Ruby 1.9/2.0.

Java and .NET both have full Unicode support, which means you can skip the annoying workarounds in this chapter and work directly with strings inside the languages. However, even with native Unicode strings, you'll always need to ensure that the data you receive from the outside world is valid in your chosen encoding. The default behavior for your language may be to throw an error when you attempt to manipulate a badly encoded string, so either filtering strings at the input boundary or being ready to catch possible exceptions deep inside your application is important. It's well worth picking up a book specific to using Unicode strings with your language of choice.