-
-
Notifications
You must be signed in to change notification settings - Fork 9.7k
Closed
Description
Symfony version(s) affected: 5.0
Description
While documenting the new String component in symfony/symfony-docs#12440 I found something which doesn't make sense to me.
A "grapheme cluster" is a series of one or more code points.
Then, why does our "grapheme string" return a single code point (the first one) instead of all of them for a given cluster?
How to reproduce
Take for example the Hindi word "hello", which is नमस्ते
.
$bString = b('नमस्ते');
echo "Bytes: ";
for ($i = 0; $i < $bString->length(); $i++) {
echo $bString->byteCode($i)." ";
}
$cString = $bString->toCodePointString();
echo "Code Points: ";
for ($i = 0; $i < $cString->length(); $i++) {
echo $cString->codePoint($i)." ";
}
$uString = $bString->toUnicodeString();
echo "Grapheme Code Points: ";
for ($i = 0; $i < $uString->length(); $i++) {
echo $uString->codePoint($i)." ";
}
The output is:
Bytes: 224 164 168 224 164 174 224 164 184 224 165 141 224 164 164 224 165 135
Code Points: 2344 2350 2360 2381 2340 2375
Grapheme Code Points: 2344 2350 2360 2340
However, I expected the output to be:
// ...
Grapheme Code Points: 2344 2350 [2360, 2381] [2340, 2375]
Possible Solution
UnicodeString
should not return ?int
but ?array
(the other classes should keep returning ?int
).