UTF-8 URIs, mod_rewrite, and Accents

I thought that would fix everything, but it didn’t. Some of our URLs started causing Apache to explode, with an unexpected 404 “NOT FOUND” error. This link hated me:√©I Googled around for a good 4 hours, trying to find something about mod_rewrite, UTF-8, accented character URIs, internationalization, etc. I found lots, but nothing helped. I even enlisted the help of Chris Hartjes, Julian Simpson, and Jeff Kolesnikowicz, but we all came up empty-handed… so I went back to basics. Modifying all the RewriteRules one by one (we have 129 lines of them). Eventually I figured it out.UTF-8 characters are not part of the a-zA-Z character set, so many of our re-write rules now failed. To fix it, I simply had to change them from this ([a-zA-Z0-9_-]_) To this (._*) Period means “any character”, and * means as many times as you like. A few key articles: - The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!) - Portable php-mysql connection charset fix - MySQL and UTF-8 at WACT - - Turning MySQL data in Latin1 to UTF-8

Comments from my old blog:

Juan said: you’re in for a lot of surprises if you continue playing with UTF-8…. watch out for most string functions, usually they’ll have a mb_* equivalent.

I’ve been “enjoying” utf-8 programming since I joined this company. at 2008-11-24 18:28:25

Lex said: You just saved me your 4 hours looking for a solution. Thanks a lot! at 2009-04-12 15:02:19

Ankzu said: You just made two days worth of headaches disappear :D at 2012-08-24 18:56:23

I’m Smokin’ Pipes

I’m not even kidding. One Hour. It’s insane. There’s one little bug in it somewhere, having to do with ATOM feeds not having an item.description, but it still works. In fact, you can see it working here: - as an RSS feed - as JSON - as native PHP - as KML - embedded in Yahoo’s interface. Click the [LIST] tab.

Of Things and Stuff

