UTF-8 URIs, mod_rewrite, and accents
Nov/082
Special characters (i.e. international characters with accents) weren’t showing up on our site, so I had to do this to fix it:
- Replace this meta tag: meta-http-equiv=”Content-Type” content=”txt/html; charset=iso-8859-1″
With this meta tag: meta-http-equiv=”Content-Type” content=”txt/html; charset=utf-8″ - Add this to index.php
header(’Content-Type: text/html; charset=utf-8′); - Add this to .htaccess
AddDefaultCharset UTF-8
AddCharset UTF-8 .tpl
AddCharset UTF-8 .js
AddCharset UTF-8 .css
AddCharset UTF-8 .php - Change the database’s encoding from latin1 to UTF-8
- Change various tables’ encoding from latin1 to UTF-8
- Modify dbConnect.class.php so that it automatically runs these two queries EVERY TIME it creates a connection:
- SET CHARACTER SET utf8;
- SET NAMES utf8 COLLATE ‘utf8_general_ci’;
- Run a 1-time query to reset any incorrect database encodings:
SET NAMES ‘utf8′ COLLATE ‘utf8_general_ci’;
SET character_set_client = ‘utf8′;
SET character_set_connection = ‘utf8′;
SET character_set_database = ‘utf8′;
SET character_set_results = ‘utf8′;
SET character_set_server = ‘utf8′;
SET collation_connection = ‘utf8_general_ci’;
SET collation_database = ‘utf8_general_ci’;
SET collation_server = ‘utf8_general_ci’;
To double-check that this worked correctly, run this query: SHOW VARIABLES LIKE ‘c%’; - Modify the MySQL config file:
default-character-set=utf8
default-collation=utf8_general_ci
init-connect=’SET NAMES utf8′
character_set_server=utf8
character_set_client=utf8
collation_server=utf8_general_ci
I thought that would fix everything, but it didn’t. Some of our URLs started causing Apache to explode, with an unexpected 404 “NOT FOUND” error. This link hated me: http://www.wikiDOMO.com/toronto_on/results/Caf√©
I Googled around for a good 4 hours, trying to find something about mod_rewrite, UTF-8, accented character URIs, internationalization, etc. I found lots, but nothing helped. I even enlisted the help of Chris Hartjes, Julian Simpson, and Jeff Kolesnikowicz, but we all came up empty-handed… so I went back to basics. Modifying all the RewriteRules one by one (we have 129 lines of them). Eventually I figured it out.
UTF-8 characters are not part of the a-zA-Z character set, so many of our re-write rules now failed.
To fix it, I simply had to change them from this ([a-zA-Z0-9_-])
To this (.*)
Period means “any character”, and * means as many times as you like.
A few key articles:
Leave a comment
No trackbacks yet.

1:28 pm on November 24th, 2008
you’re in for a lot of surprises if you continue playing with UTF-8…. watch out for most string functions, usually they’ll have a mb_* equivalent.
I’ve been “enjoying” utf-8 programming since I joined this company.
3:02 pm on April 12th, 2009
You just saved me your 4 hours looking for a solution. Thanks a lot!