-
-
Notifications
You must be signed in to change notification settings - Fork 26.1k
Description
As discussed in #19527, fitting models on data with constant feature can be surprising.
For instance a StandardScaler(with_mean=False)
fit on a column with constant values set to 1000.
will let those values passthough unchanged because the variance of the column is null. It can be surprising but is this a problem? Should we warn the user about the presence of such constant features which are typically not predictive for machine learning models?
Which estimator should warn about such constant features? The scalers can naturally detect those because they can detect them when computing the scale_
attribute. The QuantileTransformer
could also probably warn about this degenerate case.
HistGradientBoosting*
and KBinsDiscretizer
can also do it efficiently when binning the feature values.
If we do so:
- what should be the warning message? Should it be the same for all the models?
- shall we add a standard constructor param to this estimators
constant_feature={'warn', 'drop', 'passthrough', 'zero', 'one'}
with"warn"
as the default? - should we generalize this to all estimators? (ogrisel: probably not because it could be expensive and redundant input validation check so we could restrict to the estimators above where it's cheap to check)
Are there legitimate cases where such a warning would be frequent and annoying? For instance StandardScaler(with_mean=False)
after OneHotEncoding
with dense output with a categorical feature that has a category that is significantly more frequent than the others in cross-validation loop? A similar problem could happen with after OrdinalEncoding
. But would StandardScaler(with_mean=False)
would actually make sense to use in those cases?
List of estimators to consider:
- scalers (such as
StandardScaler
,RobustScaler
,MinMaxScaler
)... - estimators that do feature binning:
HistGradientBoosting*
andKBinsDiscretizer
, - feature selectors such as
SelectKBest
.