If you’ve got an UTF-8 encoded PHP string (e.g. when working with DOMDocument) and you don’t want to rely on the mbstring extension to get it’s length, this can be solved with a simple regular expression (as the string does not need to be validated):
$pattern = '([\x00-\x7F]|[\xC0-\xFF][\x80-\xBF]+)'; $length = preg_match_all($pattern, $subject, $matches);
u-modifier (PCRE8) is available, this can be shortened to:
$pattern = '(.)su'; $length = preg_match_all($pattern, $subject, $matches);
And since PHP 5.4 the
$matches parameter is optional for
preg_match_all which should have a benefit on memory usage as this only needs to count matches:
$length = preg_match_all('(.)su', $subject);
Which makes me think that
preg_match_all is just not the tool looking for, et voilá:
preg_filter('(.)su', '', $subject, -1, $length);
preg_filter function reduces the overhead to needlessly pass around matches. This is (relatively) much faster in PHP < 5.4 than using
preg_match_all with the
$matches parameter. In PHP >= 5.4 leaving
$matches out is faster again (but not that much).
Edit: There is also Patchwork UTF-8 by Nicolas Grekas, I just looked in there and it’s pretty complete. Handle with care, it contains a lot of the Unicode database which generally looks very promising (but large).