string - Clarifying Java's evolutionary support of Unicode -


i'm finding java's differentiation of char , codepoint strange , out of place.

for example, string array of characters or "letters appear in alphabet"; in contrast codepoint may single letter or possibly composite or surrogate pair. however, java defines character of string char cannot composite or contain surrogate codepoint , int (this fine).

but length() seems return number of codepoints while codepointcount() returns number of codepoints instead combines composite characters.. ends not being real count of codepoints?

it feels though charat() should return string composites , surrogates brought along , result of length() should swap codepointcount().

the original implementation feels little backwards. there reason way it's designed way is?

update: codepointat(), codepointbefore()

it's worth noting codepointat() , codepointbefore() accept index parameter, however, index acts upon chars , has range of 0 length() - 1 , therefore not based on number of codepoints in string, 1 might assume.

update: equalsignorecase()

string.equalsignorecase() uses term normalization describe prior comparing strings. misnomer normalization in context of unicode string can mean entirely different. mean use case-folding.

when java created unicode didn't have notion of surrogate characters , java decided represent characters 16bit values.

i suppose don't want break backwards compatibility. there lot more information here: http://www.oracle.com/us/technologies/java/supplementary-142654.html


Comments