RFC ColumnTransformer input validation and requirements

There have been some issues around ColumnTransformer input requirements that I think we might want to discuss more explicitly. Examples are an actual bug when changing columns: #14237 and how to define number of input features #13603.
Related is also the idea of checking for column name consistency: #7242

Mainly I'm asking whether it makes sense to have the same requirements for ColumnTransformer as for other estimators.

ColumnTransformer is the only estimator that addresses columns by name, and so reordering columns doesn't impact the model and actually everything works fine. Should we still aim for checking for reordering of columns?

Right now, ``transform`` even works if we add additional columns to ``X``, which makes sense, since it only cares about the columns it is using.

Should we allow that going forward as long as remainder is not used? (if remainder is used, the result of ``ColumnTransformer`` would have a different shape and downstream estimators would break, so I think we shouldn't support it in this case).

In either case, what's ``n_features_in_`` for a ColumnTransformer? The number of columns in ``X`` during ``fit`` or the number of columns actually used?
Does it even make sense to define it if we allow adding other columns that are ignored?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

RFC ColumnTransformer input validation and requirements #14251

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Uh oh!

RFC ColumnTransformer input validation and requirements #14251

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions