DerekMartin.ca

I'm a father, manager, programmer, scrum master, geek, & movie lover.

UTF-8 URIs, mod_rewrite, and Accents

I thought that would fix everything, but it didn’t. Some of our URLs started causing Apache to explode, with an unexpected 404 “NOT FOUND” error. This link hated me: http://www.wikiDOMO.com/toronto_on/results/Caf√©I Googled around for a good 4 hours, trying to find something about mod_rewrite, UTF-8, accented character URIs, internationalization, etc. I found lots, but nothing helped. I even enlisted the help of Chris Hartjes, Julian Simpson, and Jeff Kolesnikowicz, but we all came up empty-handed… so I went back to basics. Modifying all the RewriteRules one by one (we have 129 lines of them). Eventually I figured it out.UTF-8 characters are not part of the a-zA-Z character set, so many of our re-write rules now failed. To fix it, I simply had to change them from this ([a-zA-Z0-9_-]_) To this (._*) Period means “any character”, and * means as many times as you like. A few key articles: - The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!) - Portable php-mysql connection charset fix - MySQL and UTF-8 at WACT - - Turning MySQL data in Latin1 to UTF-8

Comments from my old blog:

Juan said: you’re in for a lot of surprises if you continue playing with UTF-8…. watch out for most string functions, usually they’ll have a mb_* equivalent.

I’ve been “enjoying” utf-8 programming since I joined this company. at 2008-11-24 18:28:25

Lex said: You just saved me your 4 hours looking for a solution. Thanks a lot! at 2009-04-12 15:02:19

Ankzu said: You just made two days worth of headaches disappear :D at 2012-08-24 18:56:23