Skip to content

Inconsistent behaviour of GroupBy for BooleanArray series  #58031

@ziviland

Description

@ziviland

Lets suppose aggregate function returns int or float. Then if it returns only 0 and 1 then result is converted to BooleanArray. Otherwise, it returns int or float arrays (as expected).

This is because this code is preserving type if series values is not a subclass of np.ndarray type. And BooleanArray is not.

if not isinstance(obj._values, np.ndarray):

So then the code tries to preserve type if it can.

Code to reproduce

df = pd.DataFrame({0: [1, 2, 2], 1: [True, False, None]})
df[1] = df[1].astype("boolean")
print(df.groupby(by=0).aggregate(lambda s: s.fillna(False).mean()).dtypes.values[0])

prints boolean.

If we change values in array

df = pd.DataFrame({0: [1, 2, 2], 1: [True, True, None]})
df[1] = df[1].astype("boolean")
print(df.groupby(by=0).aggregate(lambda s: s.fillna(False).mean()).dtypes.values[0])

then it prints float64.

If dtype is "bool" (not "boolean"), then groupby always returns expected float result.

df = pd.DataFrame({0: [1, 2, 2], 1: [True, False, None]})
df[1] = df[1].astype("bool")
print(df.groupby(by=0).aggregate(lambda s: s.fillna(False).mean()).dtypes.values[0])

prints float64

Metadata

Metadata

Labels

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions