Issues

ZF-11223: Zend_Filter_StringTrim breaks UTF-8 string if last character is uppercased Cyrillic 'Р'

Description

The filter tears away last byte of two-byte Cyrillic symbol 'Р' (U+0420, 0xD0|0xA0 sequence in UTF-8) when it's placed in the end of UTF-8 string (for example, in abbreviation 'ЮАР'), thus making the filtered UTF-8 string invalid.

The last byte (0xA0) corresponds to non-breaking space (U+00A0) in Latin-1, that's the reason why it gets trimmed. I think that trying to perform UTF-8 trimming without complete decoding UTF-8 to Unicode sequence will fail again and again in similar cases.

Comments

I've worked out a solution that works. It's rather a hack than beautiful piece of code, but at least it works.

First, we check for empty string to avoid all the processing if there's nothing to trim from. Second, we add 'u' modifier to the pattern. Now, we're starting to get error in preg_replace sometimes (on Cyrillic string 'Ром', for example). If we face such error, we delegate the task of trimming to a slower procedure.


    protected function _unicodeTrim($value, $charlist = '\\\\s')
    {
        if ('' == $value) {
            return $value;
        }

        $chars = preg_replace(
            array( '/[\^\-\]\\\]/S', '/\\\{4}/S', '/\//'),
            array( '\\\\\\0', '\\', '\/' ),
            $charlist
        );

        $pattern = '^[' . $chars . ']*|[' . $chars . ']*$';
        $result = preg_replace("/$pattern/usSD", '', $value);

        if (null === $result) {
            $result = $this->_slowUnicodeTrim($value, $chars);
        }

        return $result;
    }

In slower method we split the UTF-8 string into an array of single UTF-8 characters (using a simple hack with iconv() - UTF-32 characters are always 4 bytes long). Then we match separate characters against simple pattern that won't generate an internal error and remove them if necessary. In the end we just join the array again and return it.


    protected function _slowUnicodeTrim($value, $chars) {
        $utfChars = $this->_splitUtf8($value);
        $pattern = '/^[' . $chars . ']$/usSD';

        while ($utfChars && preg_match($pattern, $utfChars[0])) {
            array_shift($utfChars);
        }

        while ($utfChars && preg_match($pattern, $utfChars[count($utfChars) - 1])) {
            array_pop($utfChars);
        }

        return implode('', $utfChars);
    }

    protected function _splitUtf8($value)
    {
        $utfChars = str_split(iconv('UTF-8', 'UTF-32BE', $value), 4);
        array_walk($utfChars, create_function('&$char', '$char = iconv("UTF-32BE", "UTF-8", $char);'));
        return $utfChars;
    }

This issue duplicates ZF-10891

This issue has been fixed with GH-107

@thomas some problem in apply this issue to next mini release (merges to branch 1.11) ;)?

Greetings Ramon

Feel free to backport the fix from ZF2 to ZF1.