TST Add test for numerical issues in BIRCH #19253

kno10 · 2021-01-22T23:41:56Z

BIRCH has numerical issues (c.f., A. Lang, E. Schubert: BETULA: Numerically Stable CF-Trees for BIRCH Clustering. SISAP 2020: 281-296, https://arxiv.org/abs/2006.12881).

This pull request adds a unit test that demonstrates how severe the issues can become when the tree is fed input data with float16 or float32, and hence should always use at least float64 precision.

adrinjalali · 2021-01-23T15:02:25Z

linter issues :)

ogrisel

Thanks for the report. Could you please update this PR to include the fix you suggest to make those tests pass?

In particular is it required to up-cast the batch of input data itself from the start or is it enough to only make sure that some later allocated intermediate datastructure is np.float64?

ogrisel · 2021-01-26T16:02:40Z

sklearn/cluster/tests/test_birch.py

+    c = _CFSubcluster()
+    c.update(_CFSubcluster(linear_sum=np.array([0])))
+    c.update(_CFSubcluster(linear_sum=np.array([1])))
+    assert c.radius == 0.5, c.radius


I think we should not expect exact equality between too floating point numbers in tests so instead the following would be more appropriate.

Suggested change

assert c.radius == 0.5, c.radius

assert c.radius == pytest.approx(0.5)

Also, I think the trailing , c.radius is not necessary because pytest will already report an informative error message in case of failure.

Similar changes should be applied to all the other assert statements of this PR.

There are no possible rounding differences here. Its 1 divided by 2.
0.5 is exact here, and should be the exact result independent of the architecture.
But feel free to change this to your liking if you have a fix for this bug.

kno10 · 2021-01-26T20:54:27Z

I do not have a suggested fix ready.
I don't know what the preferred way of upcasting would be.

kno10 · 2021-01-29T17:56:24Z

This version uses a FP64 copy in the cluster features to resolve the fp16/fp32 test cases.
I have disabled the test that demonstrates the issue with FP64 precision, but left it in for reference. With FP64, about 8 decimal digits of precision are okay. On data that is quite off-center, it may still occur though if std(X) < 1e-8 * mean(X) in part of the data.

#19251 fixes another part of this problem, by catching negative values in the sqrt() when precision is exhausted in 923729f

ogrisel · 2021-01-30T12:43:06Z

sklearn/cluster/_birch.py

@@ -284,7 +284,8 @@ def __init__(self, *, linear_sum=None):
            self.centroid_ = self.linear_sum_ = 0
        else:
            self.n_samples_ = 1
-            self.centroid_ = self.linear_sum_ = linear_sum
+            # defensive copy and ensure FP64 precision because of numerical issues
+            self.centroid_ = self.linear_sum_ = np.array(linear_sum, dtype=np.float64)


Reading this line, I think having self.centroid_ as a reference to the same array as self.linear_sum_ could lead to bugs. I think this does not happen because update(self, subcluster) method below that fixes self.centroid_ by reallocating it. Still this is confusing. I think it should be:

Suggested change

self.centroid_ = self.linear_sum_ = np.array(linear_sum, dtype=np.float64)

self.linear_sum_ = linear_sum.astype(np.float64, copy=False)

self.centroid_ = self.linear_sum_.copy() # centroid of 1 sample

Or maybe this is an intentional optimization... In which case we could just add a comment.

Given the code quality of birch here, I doubt this is an intentional optimization.
I would use copy=True, because what if the input array X gets modified and some leaf only has a single element?
It will likely be more efficient to reuse the arrays in update then than to allocate new one each time.

github-actions bot added the module:cluster label Jan 22, 2021

kno10 mentioned this pull request Jan 23, 2021

Birch implementation : Theoretical questions #13696

Closed

ogrisel reviewed Jan 26, 2021

View reviewed changes

kno10 mentioned this pull request Jan 29, 2021

Simplify computation of radius to match BIRCH more closely #19251

Merged

kno10 added 2 commits January 29, 2021 18:50

add test for numerical issues in BIRCH

0d89a12

Always use FP64 with BIRCH, because FP32 precision was poor

2959edb

kno10 force-pushed the patch-12 branch from 0818754 to 2959edb Compare January 29, 2021 17:51

ogrisel reviewed Jan 30, 2021

View reviewed changes

glemaitre changed the title ~~Add test for numerical issues in BIRCH~~ TST Add test for numerical issues in BIRCH Aug 4, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

TST Add test for numerical issues in BIRCH #19253

TST Add test for numerical issues in BIRCH #19253

Uh oh!

kno10 commented Jan 22, 2021

Uh oh!

adrinjalali commented Jan 23, 2021

Uh oh!

ogrisel left a comment

Uh oh!

ogrisel Jan 26, 2021

Uh oh!

ogrisel Jan 26, 2021

Uh oh!

kno10 Jan 26, 2021

Uh oh!

kno10 commented Jan 26, 2021

Uh oh!

kno10 commented Jan 29, 2021

Uh oh!

ogrisel Jan 30, 2021

Uh oh!

ogrisel Jan 30, 2021

Uh oh!

kno10 Jan 30, 2021

Uh oh!

Uh oh!

	assert c.radius == 0.5, c.radius
	assert c.radius == pytest.approx(0.5)

	self.centroid_ = self.linear_sum_ = np.array(linear_sum, dtype=np.float64)
	self.linear_sum_ = linear_sum.astype(np.float64, copy=False)
	self.centroid_ = self.linear_sum_.copy() # centroid of 1 sample

Uh oh!

TST Add test for numerical issues in BIRCH #19253

Are you sure you want to change the base?

TST Add test for numerical issues in BIRCH #19253

Uh oh!

Conversation

kno10 commented Jan 22, 2021

Uh oh!

adrinjalali commented Jan 23, 2021

Uh oh!

ogrisel left a comment

Choose a reason for hiding this comment

Uh oh!

ogrisel Jan 26, 2021

Choose a reason for hiding this comment

Uh oh!

ogrisel Jan 26, 2021

Choose a reason for hiding this comment

Uh oh!

kno10 Jan 26, 2021

Choose a reason for hiding this comment

Uh oh!

kno10 commented Jan 26, 2021

Uh oh!

kno10 commented Jan 29, 2021

Uh oh!

ogrisel Jan 30, 2021

Choose a reason for hiding this comment

Uh oh!

ogrisel Jan 30, 2021

Choose a reason for hiding this comment

Uh oh!

kno10 Jan 30, 2021

Choose a reason for hiding this comment

Uh oh!

Uh oh!