Change KMeans algorithm for KBinsDiscretizer from 'elkan' (default) to 'full'

In KBinsDiscretizer KMeans is used with default parameters (eps=1e-4, algorithm='elkan').

https://github.com/scikit-learn/scikit-learn/blob/8c6a045e46abe94e43a971d4f8042728addfd6a7/sklearn/preprocessing/_discretization.py#L208

But 'full' algotithm works better. Here are timings from two different stations. I also checked two different eps values since discretization does not need high precision and reducing eps parameter from 1e-4 (default) could be beneficial:
Timings 1 (Ubuntu + Intel Core i5 8300H + 32GB) + 0.23.2
![Timings 1 (Ubuntu + Intel Core i5 8300H + 32GB)](https://user-images.githubusercontent.com/36483986/105574249-20e1ad00-5d5b-11eb-9dc0-ffa2a1301903.png)
Timings 2 (MacOS + Intel Core i7 7820HQ + 16GB) + 0.23.2
![Timings 2 (MacOS + Intel Core i7 7820HQ + 16GB) ](https://user-images.githubusercontent.com/36483986/105574363-fa704180-5d5b-11eb-9f0e-b43e901211fc.png)
Kaggle Kernel + 0.23.2
![time_kaggle](https://user-images.githubusercontent.com/36483986/105575737-a5d1c400-5d65-11eb-83ef-9e55ce8a4db1.png)
In colab (0.22.2.post1) behavior is different
![timings_colab](https://user-images.githubusercontent.com/36483986/105575483-e03a6180-5d63-11eb-8af7-354441db4f21.png)
So, I guess something changed after 0.22.2.

Description of 'elkan' method states
> The “elkan” variation is more efficient on data with well-defined clusters, by using the triangle inequality. However it’s more memory intensive due to the allocation of an extra array of shape (n_samples, n_clusters).

But IMO, assumption that 1d array would have "well-defined" clusters is a bit naive.

I also opened a feature request related to KBD [here](https://github.com/scikit-learn/scikit-learn/issues/19255). I would love to implement all these changes (and also some minor refactoring, like replacing format-strings with f-strings).



Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Change KMeans algorithm for KBinsDiscretizer from 'elkan' (default) to 'full' #19256

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Uh oh!

Change KMeans algorithm for KBinsDiscretizer from 'elkan' (default) to 'full' #19256

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions