Skip to content

ENH: Auto-detect text encoding to avoid UnicodeDecodeErrors #55197

@CSBVision

Description

@CSBVision

Feature Type

  • Adding new functionality to pandas

  • Changing existing functionality in pandas

  • Removing existing functionality in pandas

Problem Description

Our proposal improves the robustness of pandas' text importers, in particular the read_csv() function. Currently, an explicit encoding can be set or it defaults to None, which seems to be resolved to 'utf-8', but maybe this is platform-specific. Unluckily, csv files often come with different encodings. For example, Excel does not use UTF-8 by default and often users do not really care about encodings while saving such that we have to handle different file encondings. Unluckily, pandas raises UnicodeDecodeErrors if something else than 'utf-8' is required, even though text editors automatically detect the right encoding.

Feature Description

Several resources suggest to automatically detect the right enconding using chardet.detect(). Using this, the following code successfully recognized the right encoding in our experiments ('utf-8' or 'ISO-8859-1'):

import chardet
import io
filename = 'path/to/some/file.csv'     # source file
encoding = None                        # encoding can be predefined or not
with open(filename, 'rb') as file:
    data = file.read()
if encoding is None:                   # if not explicitly given, this line detects the right encoding
    encoding = chardet.detect(data)['encoding']
pd.read_csv(io.BytesIO(data), encoding=encoding)

This could be used as an additional encoding='auto' case - or even in the 'None' case instead of the current default - inside pandas directly. We don't know whether this auto detecting might fail in some cases, however it does a much better job than the current default decoding. Therefore, we would like to propose this feature.

Alternative Solutions

Alternatively, explicitly defining the right encoding is required to avoid UnicodeDecodeErrors.

Additional Context

No response

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions