Previous Page
Next Page

4.9. Using UTF-8 with Email

If your application sends out email, then it will need to support the character set and encoding used by the application itselfotherwise, you'd be in a situation where a user can register using a Cyrillic name, but you can't include a greeting to that user in any email you send.

Specifying the character set and encoding to be used with an outgoing email is very similar to specifying for a web page. Every email has one or more blocks of headers, in a format similar to HTTP headers, describing various things about the mailthe recipient, the time, the subject, and so on. Character sets and encodings are specified through the content-type header, as with HTTP responses:

Content-Type: text/plain; charset=utf-8

The problem with the content-type header is that it describes the contents of the email body. As with HTTP, email headers must be pure ASCIImany mail transport agents are not 8-bit safe and so will strip characters outside of the ASCII range. If we want to put any string data into the headers, such as subjects or sender names, then we have to do it using ASCII.

Clearly, this is madnessyou have lovely UTF-8 data and you want to use it in your email subject lines. Luckily, there's a fairly simple solution. Headers can include something defined in RFC 1342 ("Representation of Non-ASCII Text in Internet Message Headers") as an encoded word. An encoded word looks like this:

=?utf-8?Q?hello_=E2=98=BA?= 
=?charset?encoding?encoded-text?=

The charset element contains the character set name and whether the encoding is either "B" or "Q." The encoded text is the string in the specified character set, encoded using the specified method.

The "B" encoding is straightforward base64, as defined in RFC 3548. The "Q" encoding is a variation on quoted-printable, with the following rules:

  • Any byte can be represented as a literal equal sign (=) followed by a two character hex digit. For example, the byte 0x8A can be represented by the sequence =8A.

  • Spaces (byte 0x20) must be replaced with the literal underscore ( _, byte 0x5F).

  • ASCII alphanumeric characters can be left as is.

The quoted printable "Q" method is usually preferred because simple ASCII strings are still recognizable. This can aid debugging greatly and allow you to easily read the raw headers of a mail on an ASCII terminal and mostly understand them.

This encoding can be accomplished with a small PHP function:

function email_escape($text){
        $text = preg_replace('/([^a-z ])/ie', 'sprintf("=%02x", ord(StripSlashes("
       \\1")))', $text);$text = str_replace(' ', '_', $text);
     return "=?utf-8?Q?$text?=";
}

We can make a small improvement to this, thoughwe only need to escape strings that contain more than the basic characters. We save a couple of bytes for each email sent out and make the source more generally readable:

function email_escape($text){
      if (preg_match('/[^a-z ]/i', $text)){
          $text = preg_replace('/([^a-z ])/ie', 'sprintf("=%02x",
       ord(StripSlashes("\\1")))', $text);$text = str_replace(' ', '_', $text);
         return "=?utf-8?Q?$text?=";
       }
       return $text;
}

RFC 1342 states that the length of any individual encoded part should not be longer than 75 characters; to make our function fully compliant, we need to add some further modifications. Since we know each encoded part will need 12 characters of extra fluff (go on, count them), we can split up our encoded text into blocks of 63 characters or less, wrapping each with the prefix and postfix, with a new line between each. Of course, we'll need to be careful not to split an encoded character down the middle. Implementing the full function is left as an exercise for the reader.

We've talked about both body and header encoding, so all that's left is to bundle up what we've learned into a single function for safely sending UTF-8 email:

function email_send($to_name, $to_email, $subject, $message, $from_name,
$from_email){
     $from_name = email_escape($from_name);
       $to_name   = email_escape($to_name);
       $headers  = "To: \"$to_name\" <$to_email>\r\n";
       $headers .= "From: \"$from_name\" <$from_email>\r\n";
     $headers .= "Reply-To: $from_email\r\n";
     $headers .= "Content-Type: text/plain; charset=utf-8";
       $subject = email_escape($subject);
       mail($to_email, $subject, $message, $headers);
}


Previous Page
Next Page