Skip to content

Kmeans clustering centroids differ between 0.23.2 and 0.24.1 #19990

@cmacdonald

Description

@cmacdonald

Describe the bug

Clustering data with shape [540, 128] into 24 clusters using KMeans(24, random_state=42) gives different results.

I dont see anything in the KMeans documentation nor in the changelog that could explain the change.

The difference is still there in the nightly builds.

Steps/Code to Reproduce

I have provided a Colab notebook that can be configured with different pip installs.
https://colab.research.google.com/drive/1paC6VbTmRJeEEAVFXVxuOVVn6ulvCAml?usp=sharing

Expected Results

The last centroid should start (based on output from v0.23)
array([ 0.0943853 ...

Actual Results

Under the 0.24.1 and greater, the last centroid starts
array([ 7.83300400e-02, -4.15086746e-02

I know its not about different ordering - we checked that (see the colab notebook).

This is having an impact on a downstream task. Is the role of the random_state different between the versions?

Versions

Successful:

System:
    python: 3.7.10 (default, Feb 20 2021, 21:17:23)  [GCC 7.5.0]
executable: /usr/bin/python3
   machine: Linux-4.19.112+-x86_64-with-Ubuntu-18.04-bionic

Python dependencies:
          pip: 19.3.1
   setuptools: 56.0.0
      sklearn: 0.23.2
        numpy: 1.19.5
        scipy: 1.4.1
       Cython: 0.29.22
       pandas: 1.1.5
   matplotlib: 3.2.2
       joblib: 1.0.1
threadpoolctl: 2.1.0

Built with OpenMP: True

Failing:

System:
    python: 3.7.10 (default, Feb 20 2021, 21:17:23)  [GCC 7.5.0]
executable: /usr/bin/python3
   machine: Linux-4.19.112+-x86_64-with-Ubuntu-18.04-bionic

Python dependencies:
          pip: 19.3.1
   setuptools: 56.0.0
      sklearn: 0.24.1
        numpy: 1.19.5
        scipy: 1.4.1
       Cython: 0.29.22
       pandas: 1.1.5
   matplotlib: 3.2.2
       joblib: 1.0.1
threadpoolctl: 2.1.0

Built with OpenMP: True

(These outputs are from colab, we also verified the bug on a private Anaconda environment).

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions