If you’ve got an UTF-8 encoded PHP string (e.g. when working with DOMDocument) and you don’t want to rely on the mbstring extension to get it’s length, this can be solved with a simple regular expression (as the string does not need to be validated):
$pattern = '([\x00-\x7F]|[\xC0-\xFF][\x80-\xBF]+)'; $length = preg_match_all($pattern, $subject, $matches);
If the u
-modifier (PCRE8) is available, this can be shortened to:
$pattern = '(.)su'; $length = preg_match_all($pattern, $subject, $matches);
And since PHP 5.4 the $matches
parameter is optional for preg_match_all
which should have a benefit on memory usage as this only needs to count matches:
$length = preg_match_all('(.)su', $subject);
Which makes me think that preg_match_all
is just not the tool looking for, et voilá:
preg_filter('(.)su', '', $subject, -1, $length);
The preg_filter
function reduces the overhead to needlessly pass around matches. This is (relatively) much faster in PHP < 5.4 than using preg_match_all
with the $matches
parameter. In PHP >= 5.4 leaving $matches
out is faster again (but not that much).
If $subject
contains bad sequences, there is more work to do. The php-utf8 library has more to offer if the mbstring extension is not available for sure (or mb_strlen
can not be trusted).
Edit: There is also Patchwork UTF-8 by Nicolas Grekas, I just looked in there and it’s pretty complete. Handle with care, it contains a lot of the Unicode database which generally looks very promising (but large).
Pingback: Don’t use strlen() | WP Engineer