Skip to content

[String] Should UnicodeString return all code points for a given grapheme? #33923

@javiereguiluz

Description

@javiereguiluz

Symfony version(s) affected: 5.0

Description
While documenting the new String component in symfony/symfony-docs#12440 I found something which doesn't make sense to me.

A "grapheme cluster" is a series of one or more code points.

Then, why does our "grapheme string" return a single code point (the first one) instead of all of them for a given cluster?

How to reproduce
Take for example the Hindi word "hello", which is नमस्ते.

$bString = b('नमस्ते');

echo "Bytes: ";
for ($i = 0; $i < $bString->length(); $i++) {
    echo $bString->byteCode($i)." ";
}

$cString = $bString->toCodePointString();
echo "Code Points: ";
for ($i = 0; $i < $cString->length(); $i++) {
    echo $cString->codePoint($i)." ";
}

$uString = $bString->toUnicodeString();
echo "Grapheme Code Points: ";
for ($i = 0; $i < $uString->length(); $i++) {
    echo $uString->codePoint($i)." ";
}

The output is:

Bytes: 224 164 168 224 164 174 224 164 184 224 165 141 224 164 164 224 165 135

Code Points: 2344 2350 2360 2381 2340 2375

Grapheme Code Points: 2344 2350 2360 2340

However, I expected the output to be:

// ...

Grapheme Code Points: 2344 2350 [2360, 2381] [2340, 2375]

Possible Solution
UnicodeString should not return ?int but ?array (the other classes should keep returning ?int).

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions