Skip to content

Ranked batch mode sampling not compatible with sklearn's transformation+classification pipeline #104

@BoyanH

Description

@BoyanH

It is common to build an sklearn pipeline, which includes the necessary data preprocessing (and feature encoding) steps and ends with an estimator (For example, see Column Transformer with Mixed Types). The so build pipeline can then be used as a normal classifier, where the fit(X) method also fits the correspondig data transformers and transforms the data.

However, in batch.py::select_instance(), the (dis)similiraty between the training data and the instance pool is computed directly, without any data transformation

_, distance_scores = pairwise_distances_argmin_min(X_pool_masked.reshape(n_unlabeled, -1),
                                                           X_training.reshape(n_labeled_records, -1),
                                                           metric=metric)

This is not optimal, as any feature engineering & transformations are ignored. Furthermore, it completely fails if one is using a pandas dataframe to hold the data set.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions