-
-
Notifications
You must be signed in to change notification settings - Fork 18.7k
Description
Because of the new string dtype, we also implicitly changes the representation of the unique categories in the Categorical dtype repr (aside the object
-> str
change for the dtype):
>>> pd.options.future.infer_string = False
>>> pd.Categorical(list("abca"))
['a', 'b', 'c', 'a']
Categories (3, object): ['a', 'b', 'c']
>>> pd.options.future.infer_string = True
>>> pd.Categorical(list("abca"))
['a', 'b', 'c', 'a']
Categories (3, str): [a, b, c]
So the actual array values are always quotes, but the list of unique categories in the dtype repr goes from ['a', 'b', 'c']
to [a, b, c]
.
Brock already fixed a bunch of xfails in the tests because of this in #61727. And we also run into this issue for the failing doctests (#61886).
@jbrockmendel mentioned there:
It isn't 100% obvious that the new repr for Categoricals is an improvement, but it's non-crazy.
With which I agree, also no strong opinion either way.
But before we also go fixing doctests, let's confirm that we are OK with this change. Because if we don't have a strong opinion that it is an improvement, we could also leave it how it was originally (and avoiding some breakage because of this for downstream projects or users (eg who also have doctests))