-
-
Notifications
You must be signed in to change notification settings - Fork 18.7k
Description
Feature Type
-
Adding new functionality to pandas
-
Changing existing functionality in pandas
-
Removing existing functionality in pandas
Problem Description
Our proposal improves the robustness of pandas' text importers, in particular the read_csv()
function. Currently, an explicit encoding can be set or it defaults to None
, which seems to be resolved to 'utf-8'
, but maybe this is platform-specific. Unluckily, csv files often come with different encodings. For example, Excel does not use UTF-8 by default and often users do not really care about encodings while saving such that we have to handle different file encondings. Unluckily, pandas raises UnicodeDecodeErrors
if something else than 'utf-8'
is required, even though text editors automatically detect the right encoding.
Feature Description
Several resources suggest to automatically detect the right enconding using chardet.detect()
. Using this, the following code successfully recognized the right encoding in our experiments ('utf-8' or 'ISO-8859-1'):
import chardet
import io
filename = 'path/to/some/file.csv' # source file
encoding = None # encoding can be predefined or not
with open(filename, 'rb') as file:
data = file.read()
if encoding is None: # if not explicitly given, this line detects the right encoding
encoding = chardet.detect(data)['encoding']
pd.read_csv(io.BytesIO(data), encoding=encoding)
This could be used as an additional encoding='auto'
case - or even in the 'None' case instead of the current default - inside pandas directly. We don't know whether this auto detecting might fail in some cases, however it does a much better job than the current default decoding. Therefore, we would like to propose this feature.
Alternative Solutions
Alternatively, explicitly defining the right encoding is required to avoid UnicodeDecodeErrors
.
Additional Context
No response