Yahoo Pipes, Google App Engine, and Unicode
Posted by: introspect in education, misadventures, softwareThis is a follow-on from my previous post. I mentioned how I wrote my own app after giving up on Yahoo Pipes’ Unicode quirks. Well, I think I found the solution. It’s partly Yahoo’s fault, and partly me not reading some documentation.
I consider myself a programming novice in general, not just in Python, so bear with me.
On Yahoo’s side, it has inconsistent support for UTF-8. On the pipe output there is a mix of proper UTF-8 rendered characters, as well as wrong characters stemming from each byte of their multi-byte sequences being placed as-is.
Here’s a common example that I came across. Consider the byte sequence 0xE28099, typically represented in Python as “\xe2\x80\x99″. When interpreted as a UTF-8 character, it winds up as ‘U+2019′ or in Python as u\’u2019′. The u prefix indicates a unicode string, and the escape sequence basically says Unicode character 2019. If you go to print this out, it is the right single quote, i.e. an apostrophe.
But rather than rendering an apostrophe in the pipe output, you get “\xe2\x80\x99″ but in a Unicode string. So all over the place I see things like, “I\xe2\x80\x99m” and “haven\xe2\x80\x99t”. Raw UTF-8 embedded in UTF-8. Fun stuff. To my knowledge, there’s no operator within Pipes that could fix something like this, although there might be a regex that could do the trick. I don’t know the first thing about regular expressions though.
Not too long before, while looking for articles on Google App Engine, I came across a post on using GAE for the Web Service operator. Free hosting for Pipes filters? Sign me up!
Yahoo Pipes posts its data to the address in the Web Service operator, in the form of a UTF-8 encoded JSON string. It expects either a JSON string or RSS/XML document in return, but it’s probably more convenient to return JSON since you’re working with it. Plus, I did try the RSS/XML option once but Pipes output just showed a zero. I didn’t feel like pursuing the problem.
So here is the handler that services the Yahoo Pipes request:
from django.utils import simplejson
class PipesScrubber(webapp.RequestHandler):
def post(self):
data = self.request.get('data')
cleaned = tidy_unicode(data)
posts = simplejson.loads(cleaned)['items']
self.response.headers['Content-Type'] = 'application/json'
simplejson.dump(posts, self.response.out, ensure_ascii=False)
The simplejson library is bundled with django which is bundled with GAE so you don’t have to upload your own. What I neglected to do the first time I tackled this problem was to set that ensure_ascii argument to False. Looking back, it would be the obvious thing to do because if I got UTF-8 JSON, I should return UTF-8 JSON.
But by default, simplejson generates ASCII-encoded strings, while escaping all HTML and Unicode characters. I was baffled by this for the longest time, because the solution seemed sound but the Pipes output had just as strange looking stuff because of all the escaping. Turns out I should have read the simplejson documentation a bit more thoroughly.
Here is the tidy_unicode function that does all of the heavy lifting:
import StringIO
import array
def tidy_unicode(raw):
ret = StringIO.StringIO()
lim = len(raw)
i = 0
while i < lim:
try:
val = ord(raw[i])
if val < 0x80 or val > 0xF4:
ret.write(raw[i])
i += 1
elif 0xC0 <= val <= 0xDF:
s = array.array('B', [ord(c) for c in raw[i:i+2]]).tostring()
ret.write(s.decode('utf_8'))
i += 2
elif 0xE0 <= val <= 0xEF:
s = array.array('B', [ord(c) for c in raw[i:i+3]]).tostring()
ret.write(s.decode('utf_8'))
i += 3
else:
s = array.array('B', [ord(c) for c in raw[i:i+4]]).tostring()
ret.write(s.decode('utf_8'))
i += 4
except UnicodeDecodeError:
raise NameError(array.array('B', [ord(c) for c in raw[i:i+4]]))
return ret.getvalue()
The try/except block is left-over debug code when I was hitting my head against my desk. Doesn't hurt to leave it in, as the JSON strings aren't super long so performance isn't that much of a concern anyway. Which is good, because this code is quite horrible, like trying to write C in Python and expecting the same performance. But I couldn't think of any other way (always open to suggestions though). After all, the problem is that raw UTF-8 strings are being embedded in an otherwise fine UTF-8 string.
As such, you can't just do something like utf_8_str.decode('utf_8') because it would treat each "raw" byte as ASCII, find that the leading byte is greater than 127, and then blow up. Instead, the byte sequences have to be isolated and converted to a regular string which can be interpreted as UTF-8. Complicating matters a tad is that byte strings are variable length, so range checks have to be performed to calculate the length.
One thing that could be changed right off the bat, besides removing the try/except block, is the use of generator expressions instead of list comprehensions when doing the byte sequence conversion. Might be a bit more memory efficient, but on GAE memory is dirt cheap, and this app isn't using gobs of it anyway.
There you have it. Hope it was useful, or educational at least.
Tags: education, misadventures, python, software
Entries (RSS)