UTF-8 URIs, mod_rewrite, and accents

6
Nov/08
2

Special characters (i.e. international characters with accents) weren’t showing up on our site, so I had to do this to fix it:

  1. Replace this meta tag: meta-http-equiv=”Content-Type” content=”txt/html; charset=iso-8859-1″
    With this meta tag: meta-http-equiv=”Content-Type” content=”txt/html; charset=utf-8″
  2. Add this to index.php
    header(’Content-Type: text/html; charset=utf-8′);
  3. Add this to .htaccess
    AddDefaultCharset UTF-8
    AddCharset UTF-8 .tpl
    AddCharset UTF-8 .js
    AddCharset UTF-8 .css
    AddCharset UTF-8 .php
  4. Change the database’s encoding from latin1 to UTF-8
  5. Change various tables’ encoding from latin1 to UTF-8
  6. Modify dbConnect.class.php so that it automatically runs these two queries EVERY TIME it creates a connection:
    1. SET CHARACTER SET utf8;
    2. SET NAMES utf8 COLLATE ‘utf8_general_ci’;
  7. Run a 1-time query to reset any incorrect database encodings:
    SET NAMES ‘utf8′ COLLATE ‘utf8_general_ci’;
    SET character_set_client = ‘utf8′;
    SET character_set_connection = ‘utf8′;
    SET character_set_database = ‘utf8′;
    SET character_set_results = ‘utf8′;
    SET character_set_server = ‘utf8′;
    SET collation_connection = ‘utf8_general_ci’;
    SET collation_database = ‘utf8_general_ci’;
    SET collation_server = ‘utf8_general_ci’;
    To double-check that this worked correctly, run this query: SHOW VARIABLES LIKE ‘c%’;
  8. Modify the MySQL config file:
    default-character-set=utf8
    default-collation=utf8_general_ci
    init-connect=’SET NAMES utf8′
    character_set_server=utf8
    character_set_client=utf8
    collation_server=utf8_general_ci

I thought that would fix everything, but it didn’t. Some of our URLs started causing Apache to explode, with an unexpected 404 “NOT FOUND” error. This link hated me: http://www.wikiDOMO.com/toronto_on/results/Caf√©

I Googled around for a good 4 hours, trying to find something about mod_rewrite, UTF-8, accented character URIs, internationalization, etc. I found lots, but nothing helped. I even enlisted the help of Chris Hartjes, Julian Simpson, and Jeff Kolesnikowicz, but we all came up empty-handed… so I went back to basics. Modifying all the RewriteRules one by one (we have 129 lines of them). Eventually I figured it out.

UTF-8 characters are not part of the a-zA-Z character set, so many of our re-write rules now failed.

To fix it, I simply had to change them from this ([a-zA-Z0-9_-])
To this (.
*)

Period means “any character”, and * means as many times as you like.

A few key articles:

Comments (2) Trackbacks (0)
  1. Juan
    1:28 pm on November 24th, 2008

    you’re in for a lot of surprises if you continue playing with UTF-8…. watch out for most string functions, usually they’ll have a mb_* equivalent.

    I’ve been “enjoying” utf-8 programming since I joined this company.

  2. Lex
    3:02 pm on April 12th, 2009

    You just saved me your 4 hours looking for a solution. Thanks a lot!

Leave a comment

No trackbacks yet.

Get Adobe Flash playerPlugin by wpburn.com wordpress themes