ENH: Auto-detect text encoding to avoid UnicodeDecodeErrors

### Feature Type

- [X] Adding new functionality to pandas

- [ ] Changing existing functionality in pandas

- [ ] Removing existing functionality in pandas


### Problem Description

Our proposal improves the robustness of pandas' text importers, in particular the `read_csv()` function. Currently, an explicit encoding can be set or it defaults to `None`, which seems to be resolved to `'utf-8'`, but maybe this is platform-specific. Unluckily, csv files often come with different encodings. For example, Excel does not use UTF-8 by default and often users do not really care about encodings while saving such that we have to handle different file encondings. Unluckily, pandas raises `UnicodeDecodeErrors` if something else than `'utf-8'` is required, even though text editors automatically detect the right encoding.

### Feature Description

Several resources suggest to automatically detect the right enconding using `chardet.detect()`. Using this, the following code successfully recognized the right encoding in our experiments ('utf-8' or 'ISO-8859-1'):

```
import chardet
import io
filename = 'path/to/some/file.csv'     # source file
encoding = None                        # encoding can be predefined or not
with open(filename, 'rb') as file:
    data = file.read()
if encoding is None:                   # if not explicitly given, this line detects the right encoding
    encoding = chardet.detect(data)['encoding']
pd.read_csv(io.BytesIO(data), encoding=encoding)
```
This could be used as an additional `encoding='auto'` case - or even in the 'None' case instead of the current default - inside pandas directly. We don't know whether this auto detecting might fail in some cases, however it does a much better job than the current default decoding. Therefore, we would like to propose this feature.



### Alternative Solutions

Alternatively, explicitly defining the right encoding is required to avoid `UnicodeDecodeErrors`.

### Additional Context

_No response_

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

ENH: Auto-detect text encoding to avoid UnicodeDecodeErrors #55197

Feature Type

Problem Description

Feature Description

Alternative Solutions

Additional Context

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Uh oh!

ENH: Auto-detect text encoding to avoid UnicodeDecodeErrors #55197

Description

Feature Type

Problem Description

Feature Description

Alternative Solutions

Additional Context

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions