-
-
Notifications
You must be signed in to change notification settings - Fork 18.7k
Open
Labels
GroupbyNA - MaskedArraysRelated to pd.NA and nullable extension arraysRelated to pd.NA and nullable extension arraysReduction Operationssum, mean, min, max, etc.sum, mean, min, max, etc.
Description
Lets suppose aggregate function returns int or float. Then if it returns only 0 and 1 then result is converted to BooleanArray. Otherwise, it returns int or float arrays (as expected).
This is because this code is preserving type if series values is not a subclass of np.ndarray type. And BooleanArray is not.
pandas/pandas/core/groupby/ops.py
Line 917 in b552dc9
if not isinstance(obj._values, np.ndarray): |
So then the code tries to preserve type if it can.
Code to reproduce
df = pd.DataFrame({0: [1, 2, 2], 1: [True, False, None]})
df[1] = df[1].astype("boolean")
print(df.groupby(by=0).aggregate(lambda s: s.fillna(False).mean()).dtypes.values[0])
prints boolean.
If we change values in array
df = pd.DataFrame({0: [1, 2, 2], 1: [True, True, None]})
df[1] = df[1].astype("boolean")
print(df.groupby(by=0).aggregate(lambda s: s.fillna(False).mean()).dtypes.values[0])
then it prints float64.
If dtype is "bool" (not "boolean"), then groupby always returns expected float result.
df = pd.DataFrame({0: [1, 2, 2], 1: [True, False, None]})
df[1] = df[1].astype("bool")
print(df.groupby(by=0).aggregate(lambda s: s.fillna(False).mean()).dtypes.values[0])
prints float64
Metadata
Metadata
Labels
GroupbyNA - MaskedArraysRelated to pd.NA and nullable extension arraysRelated to pd.NA and nullable extension arraysReduction Operationssum, mean, min, max, etc.sum, mean, min, max, etc.