-
-
Notifications
You must be signed in to change notification settings - Fork 26.1k
Description
In KBinsDiscretizer KMeans is used with default parameters (eps=1e-4, algorithm='elkan').
km = KMeans(n_clusters=n_bins[jj], init=init, n_init=1) |
But 'full' algotithm works better. Here are timings from two different stations. I also checked two different eps values since discretization does not need high precision and reducing eps parameter from 1e-4 (default) could be beneficial:
Timings 1 (Ubuntu + Intel Core i5 8300H + 32GB) + 0.23.2
Timings 2 (MacOS + Intel Core i7 7820HQ + 16GB) + 0.23.2
Kaggle Kernel + 0.23.2
In colab (0.22.2.post1) behavior is different
So, I guess something changed after 0.22.2.
Description of 'elkan' method states
The “elkan” variation is more efficient on data with well-defined clusters, by using the triangle inequality. However it’s more memory intensive due to the allocation of an extra array of shape (n_samples, n_clusters).
But IMO, assumption that 1d array would have "well-defined" clusters is a bit naive.
I also opened a feature request related to KBD here. I would love to implement all these changes (and also some minor refactoring, like replacing format-strings with f-strings).