Skip to content

Inconsisten handling of non-ASCII characters in encodings.normalize_encoding() #136736

@serhiy-storchaka

Description

@serhiy-storchaka

Bug report

#83518 changed handling of non-ASCII characters in encodings.normalize_encoding(), but it is still inconsistent with codecs.lookup(), and not even self-consistent. For example:

>>> import encodings
>>> encodings.normalize_encoding('a¤b')
'a_b'
>>> encodings.normalize_encoding('aæb')
'ab'
>>> encodings.normalize_encoding('a-¤')
'a'
>>> encodings.normalize_encoding('a-æ')
'a_'
>>> encodings.normalize_encoding('a-¤-b')
'a_b'
>>> encodings.normalize_encoding('a-æ-b')
'a__b'

You can even get an underscore at the end or repeated underscores in the middle.

cc @malemburg, @vstinner, @shihai1991

Linked PRs

Metadata

Metadata

Assignees

No one assigned

    Labels

    3.13bugs and security fixes3.14bugs and security fixes3.15new features, bugs and security fixesstdlibPython modules in the Lib dirtopic-unicodetype-bugAn unexpected behavior, bug, or error

    Projects

    Status

    No status

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions