PHP UTF-8 String Length

If you’ve got an UTF-8 encoded PHP string (e.g. when working with DOMDocument) and you don’t want to rely on the mbstring extension to get it’s length, this can be solved with a simple regular expression (as the string does not need to be validated):

$pattern = '([\x00-\x7F]|[\xC0-\xFF][\x80-\xBF]+)';
$length = preg_match_all($pattern, $subject, $matches);

If the u-modifier (PCRE8) is available, this can be shortened to:

$pattern = '(.)su';
$length = preg_match_all($pattern, $subject, $matches);

And since PHP 5.4 the $matches parameter is optional for preg_match_all which should have a benefit on memory usage as this only needs to count matches:

$length = preg_match_all('(.)su', $subject);

Which makes me think that preg_match_all is just not the tool looking for, et voilá:

preg_filter('(.)su', '', $subject, -1, $length);

The preg_filter function reduces the overhead to needlessly pass around matches. This is (relatively) much faster in PHP < 5.4 than using preg_match_all with the $matches parameter. In PHP >= 5.4 leaving $matches out is faster again (but not that much).

If $subject contains bad sequences, there is more work to do. The php-utf8 library has more to offer if the mbstring extension is not available for sure (or mb_strlen can not be trusted).

Edit: There is also Patchwork UTF-8 by Nicolas Grekas, I just looked in there and it’s pretty complete. Handle with care, it contains a lot of the Unicode database which generally looks very promising (but large).

This entry was posted in PHP Development, PHP Library, Pressed, The Know Your Language Department and tagged , , , , , , , , . Bookmark the permalink.

1 Response to PHP UTF-8 String Length

  1. Pingback: Don’t use strlen() | WP Engineer

Leave a comment

This site uses Akismet to reduce spam. Learn how your comment data is processed.