-
-
Notifications
You must be signed in to change notification settings - Fork 26.1k
Description
There have been some issues around ColumnTransformer input requirements that I think we might want to discuss more explicitly. Examples are an actual bug when changing columns: #14237 and how to define number of input features #13603.
Related is also the idea of checking for column name consistency: #7242
Mainly I'm asking whether it makes sense to have the same requirements for ColumnTransformer as for other estimators.
ColumnTransformer is the only estimator that addresses columns by name, and so reordering columns doesn't impact the model and actually everything works fine. Should we still aim for checking for reordering of columns?
Right now, transform
even works if we add additional columns to X
, which makes sense, since it only cares about the columns it is using.
Should we allow that going forward as long as remainder is not used? (if remainder is used, the result of ColumnTransformer
would have a different shape and downstream estimators would break, so I think we shouldn't support it in this case).
In either case, what's n_features_in_
for a ColumnTransformer? The number of columns in X
during fit
or the number of columns actually used?
Does it even make sense to define it if we allow adding other columns that are ignored?