Skip to content

Change KMeans algorithm for KBinsDiscretizer from 'elkan' (default) to 'full' #19256

@glevv

Description

@glevv

In KBinsDiscretizer KMeans is used with default parameters (eps=1e-4, algorithm='elkan').

km = KMeans(n_clusters=n_bins[jj], init=init, n_init=1)

But 'full' algotithm works better. Here are timings from two different stations. I also checked two different eps values since discretization does not need high precision and reducing eps parameter from 1e-4 (default) could be beneficial:
Timings 1 (Ubuntu + Intel Core i5 8300H + 32GB) + 0.23.2
Timings 1 (Ubuntu + Intel Core i5 8300H + 32GB)
Timings 2 (MacOS + Intel Core i7 7820HQ + 16GB) + 0.23.2
Timings 2 (MacOS + Intel Core i7 7820HQ + 16GB)
Kaggle Kernel + 0.23.2
time_kaggle
In colab (0.22.2.post1) behavior is different
timings_colab
So, I guess something changed after 0.22.2.

Description of 'elkan' method states

The “elkan” variation is more efficient on data with well-defined clusters, by using the triangle inequality. However it’s more memory intensive due to the allocation of an extra array of shape (n_samples, n_clusters).

But IMO, assumption that 1d array would have "well-defined" clusters is a bit naive.

I also opened a feature request related to KBD here. I would love to implement all these changes (and also some minor refactoring, like replacing format-strings with f-strings).

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions