2014年5月12日星期一

Edd Mann: Reversing a Unicode String in PHP using UTF-16BE/LE


Edd Mann looks at an issue in his latest post that caused him problems in a recent project, reversing a Unicode string with UTF-16BE/LE.



Last week I was bit by the Unicode encoding issue when trying to naively manipulate a user's input using PHP's built-in string functions. PHP simply assumes that all characters are a single byte (octet) and the provided functions use this assumption when processing a string. [...] You should be aware that in 'Western Europe' we commonly only use the basic ASCII character-set (consisting of 7 bytes). This makes the transition to the popular 'UTF-8' Unicode representation almost seamless, as the two map one-to-one. I wish to however, discuss how to reverse a Unicode string (UTF-8) using a combination of endianness magic and the 'strrev' function.


He provides two different approaches to the problem. The first he calls the "naive" approach because it corrupts characters needing more than the two-byte representation. His second solution, the "endianness" method, converts the string to big-endian first (UTF-16) and then back to UTF-8 for more correct handling.


Link: http://eddmann.com/posts/reversing-a-unicode-string-in-php-using-utf-16-be-le

没有评论:

发表评论