Skip to content

RFC ColumnTransformer input validation and requirements #14251

@amueller

Description

@amueller

There have been some issues around ColumnTransformer input requirements that I think we might want to discuss more explicitly. Examples are an actual bug when changing columns: #14237 and how to define number of input features #13603.
Related is also the idea of checking for column name consistency: #7242

Mainly I'm asking whether it makes sense to have the same requirements for ColumnTransformer as for other estimators.

ColumnTransformer is the only estimator that addresses columns by name, and so reordering columns doesn't impact the model and actually everything works fine. Should we still aim for checking for reordering of columns?

Right now, transform even works if we add additional columns to X, which makes sense, since it only cares about the columns it is using.

Should we allow that going forward as long as remainder is not used? (if remainder is used, the result of ColumnTransformer would have a different shape and downstream estimators would break, so I think we shouldn't support it in this case).

In either case, what's n_features_in_ for a ColumnTransformer? The number of columns in X during fit or the number of columns actually used?
Does it even make sense to define it if we allow adding other columns that are ignored?

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions