Skip to content

Support obtaining Script_Extensions of a character #4

@Manishearth

Description

@Manishearth

This is needed for mixed script detection.

The easy way to do this is just to store a slice of script_extensions for each code point / range, but there's actually a limited set of ways script_extensions mix (taken from here):

 Adlam (Adlam),
 Adlam,Arabic,Hanifi_Rohingya,Mandaic,Manichaean,Psalter_Pahlavi,Sogdian,Syriac (Adlam,Arabic,Hanifi_Rohingya,Mandaic,Manichaean,Psalter_Pahlavi,Sogdian,Syriac),
 Ahom (Ahom),
 Anatolian_Hieroglyphs (Anatolian_Hieroglyphs),
 Arabic (Arabic),
 Arabic,Coptic (Arabic,Coptic),
 Arabic,Hanifi_Rohingya (Arabic,Hanifi_Rohingya),
 Arabic,Hanifi_Rohingya,Syriac,Thaana (Arabic,Hanifi_Rohingya,Syriac,Thaana),
 Arabic,Syriac (Arabic,Syriac),
 Arabic,Syriac,Thaana (Arabic,Syriac,Thaana),
 Arabic,Thaana (Arabic,Thaana),
 Armenian (Armenian),
 Armenian,Georgian (Armenian,Georgian),
 Avestan (Avestan),
 Balinese (Balinese),
 Bamum (Bamum),
 Bassa_Vah (Bassa_Vah),
 Batak (Batak),
 Bengali (Bengali),
 Bengali,Chakma,Syloti_Nagri (Bengali,Chakma,Syloti_Nagri),
 Bengali,Devanagari (Bengali,Devanagari),
 Bengali,Devanagari,Dogra,Grantha,Gujarati,Gunjala_Gondi,Gurmukhi,Kannada,Khudawadi,Limbu,Mahajani,Malayalam,Masaram_Gondi,Nandinagari,Oriya,Sinhala,Syloti_Nagri,Takri,Tamil,Telugu,Tirhuta (Bengali,Devanagari,Dogra,Grantha,Gujarati,Gunjala_Gondi,Gurmukhi,Kannada,Khudawadi,Limbu,Mahajani,Malayalam,Masaram_Gondi,Nandinagari,Oriya,Sinhala,Syloti_Nagri,Takri,Tamil,Telugu,Tirhuta),
 Bengali,Devanagari,Dogra,Grantha,Gujarati,Gunjala_Gondi,Gurmukhi,Kannada,Khudawadi,Mahajani,Malayalam,Masaram_Gondi,Nandinagari,Oriya,Sinhala,Syloti_Nagri,Takri,Tamil,Telugu,Tirhuta (Bengali,Devanagari,Dogra,Grantha,Gujarati,Gunjala_Gondi,Gurmukhi,Kannada,Khudawadi,Mahajani,Malayalam,Masaram_Gondi,Nandinagari,Oriya,Sinhala,Syloti_Nagri,Takri,Tamil,Telugu,Tirhuta),
 Bengali,Devanagari,Grantha,Gujarati,Gurmukhi,Kannada,Latin,Malayalam,Oriya,Sharada,Tamil,Telugu,Tirhuta (Bengali,Devanagari,Grantha,Gujarati,Gurmukhi,Kannada,Latin,Malayalam,Oriya,Sharada,Tamil,Telugu,Tirhuta),
 Bengali,Devanagari,Grantha,Gujarati,Gurmukhi,Kannada,Latin,Malayalam,Oriya,Tamil,Telugu,Tirhuta (Bengali,Devanagari,Grantha,Gujarati,Gurmukhi,Kannada,Latin,Malayalam,Oriya,Tamil,Telugu,Tirhuta),
 Bengali,Devanagari,Grantha,Kannada (Bengali,Devanagari,Grantha,Kannada),
 Bengali,Devanagari,Grantha,Kannada,Nandinagari,Oriya,Telugu,Tirhuta (Bengali,Devanagari,Grantha,Kannada,Nandinagari,Oriya,Telugu,Tirhuta),
 Bhaiksuki (Bhaiksuki),
 Bopomofo (Bopomofo),
 Bopomofo,Han (Bopomofo,Han),
 Bopomofo,Han,Hangul,Hiragana,Katakana (Bopomofo,Han,Hangul,Hiragana,Katakana),
 Bopomofo,Han,Hangul,Hiragana,Katakana,Yi (Bopomofo,Han,Hangul,Hiragana,Katakana,Yi),
 Brahmi (Brahmi),
 Braille (Braille),
 Buginese (Buginese),
 Buginese,Javanese (Buginese,Javanese),
 Buhid (Buhid),
 Buhid,Hanunoo,Tagalog,Tagbanwa (Buhid,Hanunoo,Tagalog,Tagbanwa),
 Canadian_Aboriginal (Canadian_Aboriginal),
 Carian (Carian),
 Caucasian_Albanian (Caucasian_Albanian),
 Chakma (Chakma),
 Chakma,Myanmar,Tai_Le (Chakma,Myanmar,Tai_Le),
 Cham (Cham),
 Cherokee (Cherokee),
 Common (Common),
 Coptic (Coptic),
 Cuneiform (Cuneiform),
 Cypriot (Cypriot),
 Cypriot,Linear_A,Linear_B (Cypriot,Linear_A,Linear_B),
 Cypriot,Linear_B (Cypriot,Linear_B),
 Cyrillic (Cyrillic),
 Cyrillic,Glagolitic (Cyrillic,Glagolitic),
 Cyrillic,Latin (Cyrillic,Latin),
 Cyrillic,Old_Permic (Cyrillic,Old_Permic),
 Deseret (Deseret),
 Devanagari (Devanagari),
 Devanagari,Dogra,Gujarati,Gurmukhi,Kaithi,Kannada,Khojki,Khudawadi,Mahajani,Malayalam,Modi,Nandinagari,Takri,Tirhuta (Devanagari,Dogra,Gujarati,Gurmukhi,Kaithi,Kannada,Khojki,Khudawadi,Mahajani,Malayalam,Modi,Nandinagari,Takri,Tirhuta),
 Devanagari,Dogra,Gujarati,Gurmukhi,Kaithi,Kannada,Khojki,Khudawadi,Mahajani,Modi,Nandinagari,Takri,Tirhuta (Devanagari,Dogra,Gujarati,Gurmukhi,Kaithi,Kannada,Khojki,Khudawadi,Mahajani,Modi,Nandinagari,Takri,Tirhuta),
 Devanagari,Dogra,Gujarati,Gurmukhi,Kaithi,Khojki,Khudawadi,Mahajani,Modi,Takri,Tirhuta (Devanagari,Dogra,Gujarati,Gurmukhi,Kaithi,Khojki,Khudawadi,Mahajani,Modi,Takri,Tirhuta),
 Devanagari,Dogra,Kaithi,Mahajani (Devanagari,Dogra,Kaithi,Mahajani),
 Devanagari,Grantha (Devanagari,Grantha),
 Devanagari,Grantha,Kannada (Devanagari,Grantha,Kannada),
 Devanagari,Grantha,Latin (Devanagari,Grantha,Latin),
 Devanagari,Kannada,Malayalam,Oriya,Tamil,Telugu (Devanagari,Kannada,Malayalam,Oriya,Tamil,Telugu),
 Devanagari,Nandinagari (Devanagari,Nandinagari),
 Devanagari,Sharada (Devanagari,Sharada),
 Devanagari,Tamil (Devanagari,Tamil),
 Dogra (Dogra),
 Duployan (Duployan),
 Egyptian_Hieroglyphs (Egyptian_Hieroglyphs),
 Elbasan (Elbasan),
 Elymaic (Elymaic),
 Ethiopic (Ethiopic),
 Georgian (Georgian),
 Georgian,Latin (Georgian,Latin),
 Glagolitic (Glagolitic),
 Gothic (Gothic),
 Grantha (Grantha),
 Grantha,Tamil (Grantha,Tamil),
 Greek (Greek),
 Gujarati (Gujarati),
 Gujarati,Khojki (Gujarati,Khojki),
 Gunjala_Gondi (Gunjala_Gondi),
 Gurmukhi (Gurmukhi),
 Gurmukhi,Multani (Gurmukhi,Multani),
 Han (Han),
 Han,Hiragana,Katakana (Han,Hiragana,Katakana),
 Hangul (Hangul),
 Hanifi_Rohingya (Hanifi_Rohingya),
 Hanunoo (Hanunoo),
 Hatran (Hatran),
 Hebrew (Hebrew),
 Hiragana (Hiragana),
 Hiragana,Katakana (Hiragana,Katakana),

 Imperial_Aramaic (Imperial_Aramaic),
 Inherited (Inherited),
 Inscriptional_Pahlavi (Inscriptional_Pahlavi),
 Inscriptional_Parthian (Inscriptional_Parthian),

 Javanese (Javanese),

 Kaithi (Kaithi),
 Kannada (Kannada),
 Kannada,Nandinagari (Kannada,Nandinagari),
 Katakana (Katakana),
 Kayah_Li (Kayah_Li),
 Kayah_Li,Latin,Myanmar (Kayah_Li,Latin,Myanmar),
 Kharoshthi (Kharoshthi),
 Khmer (Khmer),
 Khojki (Khojki),
 Khudawadi (Khudawadi),

 Lao (Lao),
 Latin (Latin),
 Latin,Mongolian (Latin,Mongolian),
 Lepcha (Lepcha),
 Limbu (Limbu),
 Linear_A (Linear_A),
 Linear_B (Linear_B),
 Lisu (Lisu),
 Lycian (Lycian),
 Lydian (Lydian),

 Mahajani (Mahajani),
 Makasar (Makasar),
 Malayalam (Malayalam),
 Mandaic (Mandaic),
 Manichaean (Manichaean),
 Marchen (Marchen),
 Masaram_Gondi (Masaram_Gondi),
 Medefaidrin (Medefaidrin),
 Meetei_Mayek (Meetei_Mayek),
 Mende_Kikakui (Mende_Kikakui),
 Meroitic_Cursive (Meroitic_Cursive),
 Meroitic_Hieroglyphs (Meroitic_Hieroglyphs),
 Miao (Miao),
 Modi (Modi),
 Mongolian (Mongolian),
 Mongolian,Phags_Pa (Mongolian,Phags_Pa),
 Mro (Mro),
 Multani (Multani),
 Myanmar (Myanmar),

Nabataean (Nabataean),
 Nandinagari (Nandinagari),
 New_Tai_Lue (New_Tai_Lue),
 Newa (Newa),
 Nko (Nko),
 Nushu (Nushu),
 Nyiakeng_Puachue_Hmong (Nyiakeng_Puachue_Hmong),

Ogham (Ogham),
 Ol_Chiki (Ol_Chiki),
 Old_Hungarian (Old_Hungarian),
 Old_Italic (Old_Italic),
 Old_North_Arabian (Old_North_Arabian),
 Old_Permic (Old_Permic),
 Old_Persian (Old_Persian),
 Old_Sogdian (Old_Sogdian),
 Old_South_Arabian (Old_South_Arabian),
 Old_Turkic (Old_Turkic),
 Oriya (Oriya),
 Osage (Osage),
 Osmanya (Osmanya),

Pahawh_Hmong (Pahawh_Hmong),
 Palmyrene (Palmyrene),
 Pau_Cin_Hau (Pau_Cin_Hau),
 Phags_Pa (Phags_Pa),
 Phoenician (Phoenician),
 Psalter_Pahlavi (Psalter_Pahlavi),

Rejang (Rejang),
 Runic (Runic),

Samaritan (Samaritan),
 Saurashtra (Saurashtra),
 Sharada (Sharada),
 Shavian (Shavian),
 Siddham (Siddham),
 Sign_Writing (Sign_Writing),
 Sinhala (Sinhala),
 Sogdian (Sogdian),
 Sora_Sompeng (Sora_Sompeng),
 Soyombo (Soyombo),
 Sundanese (Sundanese),
 Syloti_Nagri (Syloti_Nagri),
 Syriac (Syriac),

Tagalog (Tagalog),
 Tagbanwa (Tagbanwa),
 Tai_Le (Tai_Le),
 Tai_Tham (Tai_Tham),
 Tai_Viet (Tai_Viet),
 Takri (Takri),
 Tamil (Tamil),
 Tangut (Tangut),
 Telugu (Telugu),
 Thaana (Thaana),
 Thai (Thai),
 Tibetan (Tibetan),
 Tifinagh (Tifinagh),
 Tirhuta (Tirhuta),

Ugaritic (Ugaritic),
 Unknown (Unknown),

Vai (Vai),
Wancho (Wancho),
Warang_Citi (Warang_Citi),
Yi (Yi),
Zanabazar_Square (Zanabazar_Square)

We can very easily make a single enum value for each one, and programmatically generate an intersect() function that can calculate intersections. This would be faster.

(For performance it would also probably be worth only running these checks on non-ascii identifiers)

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions