Skip to content

[RFC] Should scalers or other estimators warn when fit on constant features? #19547

@ogrisel

Description

@ogrisel

As discussed in #19527, fitting models on data with constant feature can be surprising.

For instance a StandardScaler(with_mean=False) fit on a column with constant values set to 1000. will let those values passthough unchanged because the variance of the column is null. It can be surprising but is this a problem? Should we warn the user about the presence of such constant features which are typically not predictive for machine learning models?

Which estimator should warn about such constant features? The scalers can naturally detect those because they can detect them when computing the scale_ attribute. The QuantileTransformer could also probably warn about this degenerate case.

HistGradientBoosting* and KBinsDiscretizer can also do it efficiently when binning the feature values.

If we do so:

  • what should be the warning message? Should it be the same for all the models?
  • shall we add a standard constructor param to this estimators constant_feature={'warn', 'drop', 'passthrough', 'zero', 'one'} with "warn" as the default?
  • should we generalize this to all estimators? (ogrisel: probably not because it could be expensive and redundant input validation check so we could restrict to the estimators above where it's cheap to check)

Are there legitimate cases where such a warning would be frequent and annoying? For instance StandardScaler(with_mean=False) after OneHotEncoding with dense output with a categorical feature that has a category that is significantly more frequent than the others in cross-validation loop? A similar problem could happen with after OrdinalEncoding. But would StandardScaler(with_mean=False) would actually make sense to use in those cases?

List of estimators to consider:

  • scalers (such as StandardScaler, RobustScaler, MinMaxScaler)...
  • estimators that do feature binning: HistGradientBoosting* and KBinsDiscretizer,
  • feature selectors such as SelectKBest.

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions