Skip to content

Output formatting: the repr of the Categorical categories (quoted or unquoted strings?) #61890

@jorisvandenbossche

Description

@jorisvandenbossche

Because of the new string dtype, we also implicitly changes the representation of the unique categories in the Categorical dtype repr (aside the object -> str change for the dtype):

>>> pd.options.future.infer_string = False
>>> pd.Categorical(list("abca"))
['a', 'b', 'c', 'a']
Categories (3, object): ['a', 'b', 'c']
>>> pd.options.future.infer_string = True
>>> pd.Categorical(list("abca"))
['a', 'b', 'c', 'a']
Categories (3, str): [a, b, c]

So the actual array values are always quotes, but the list of unique categories in the dtype repr goes from ['a', 'b', 'c'] to [a, b, c].

Brock already fixed a bunch of xfails in the tests because of this in #61727. And we also run into this issue for the failing doctests (#61886).

@jbrockmendel mentioned there:

It isn't 100% obvious that the new repr for Categoricals is an improvement, but it's non-crazy.

With which I agree, also no strong opinion either way.

But before we also go fixing doctests, let's confirm that we are OK with this change. Because if we don't have a strong opinion that it is an improvement, we could also leave it how it was originally (and avoiding some breakage because of this for downstream projects or users (eg who also have doctests))

Metadata

Metadata

Assignees

No one assigned

    Labels

    CategoricalCategorical Data TypeNeeds DiscussionRequires discussion from core team before further actionOutput-Formatting__repr__ of pandas objects, to_string

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions