Ranked batch mode sampling not compatible with sklearn's transformation+classification pipeline

It is common to build an sklearn pipeline, which includes the necessary data preprocessing (and feature encoding) steps and ends with an estimator (For example, see [Column Transformer with Mixed Types](https://scikit-learn.org/stable/auto_examples/compose/plot_column_transformer_mixed_types.html)). The so build pipeline can then be used as a normal classifier, where the `fit(X)` method also fits the correspondig data transformers and transforms the data.

However, in `batch.py::select_instance()`, the (dis)similiraty between the training data and the instance pool is computed directly, without any data transformation

```python
_, distance_scores = pairwise_distances_argmin_min(X_pool_masked.reshape(n_unlabeled, -1),
                                                           X_training.reshape(n_labeled_records, -1),
                                                           metric=metric)
```

This is not optimal, as any feature engineering & transformations are ignored. Furthermore, it completely fails if one is using a pandas dataframe to hold the data set.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Ranked batch mode sampling not compatible with sklearn's transformation+classification pipeline #104

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Ranked batch mode sampling not compatible with sklearn's transformation+classification pipeline #104

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions