Skip to content

TST Add test for numerical issues in BIRCH #19253

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 2 commits into
base: main
Choose a base branch
from

Conversation

kno10
Copy link
Contributor

@kno10 kno10 commented Jan 22, 2021

BIRCH has numerical issues (c.f., A. Lang, E. Schubert: BETULA: Numerically Stable CF-Trees for BIRCH Clustering. SISAP 2020: 281-296, https://arxiv.org/abs/2006.12881).

This pull request adds a unit test that demonstrates how severe the issues can become when the tree is fed input data with float16 or float32, and hence should always use at least float64 precision.

@adrinjalali
Copy link
Member

linter issues :)

Copy link
Member

@ogrisel ogrisel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the report. Could you please update this PR to include the fix you suggest to make those tests pass?

In particular is it required to up-cast the batch of input data itself from the start or is it enough to only make sure that some later allocated intermediate datastructure is np.float64?

c = _CFSubcluster()
c.update(_CFSubcluster(linear_sum=np.array([0])))
c.update(_CFSubcluster(linear_sum=np.array([1])))
assert c.radius == 0.5, c.radius
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should not expect exact equality between too floating point numbers in tests so instead the following would be more appropriate.

Suggested change
assert c.radius == 0.5, c.radius
assert c.radius == pytest.approx(0.5)

Also, I think the trailing , c.radius is not necessary because pytest will already report an informative error message in case of failure.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Similar changes should be applied to all the other assert statements of this PR.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are no possible rounding differences here. Its 1 divided by 2.
0.5 is exact here, and should be the exact result independent of the architecture.
But feel free to change this to your liking if you have a fix for this bug.

@kno10
Copy link
Contributor Author

kno10 commented Jan 26, 2021

I do not have a suggested fix ready.
I don't know what the preferred way of upcasting would be.

@kno10
Copy link
Contributor Author

kno10 commented Jan 29, 2021

This version uses a FP64 copy in the cluster features to resolve the fp16/fp32 test cases.
I have disabled the test that demonstrates the issue with FP64 precision, but left it in for reference. With FP64, about 8 decimal digits of precision are okay. On data that is quite off-center, it may still occur though if std(X) < 1e-8 * mean(X) in part of the data.

#19251 fixes another part of this problem, by catching negative values in the sqrt() when precision is exhausted in 923729f

@@ -284,7 +284,8 @@ def __init__(self, *, linear_sum=None):
self.centroid_ = self.linear_sum_ = 0
else:
self.n_samples_ = 1
self.centroid_ = self.linear_sum_ = linear_sum
# defensive copy and ensure FP64 precision because of numerical issues
self.centroid_ = self.linear_sum_ = np.array(linear_sum, dtype=np.float64)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reading this line, I think having self.centroid_ as a reference to the same array as self.linear_sum_ could lead to bugs. I think this does not happen because update(self, subcluster) method below that fixes self.centroid_ by reallocating it. Still this is confusing. I think it should be:

Suggested change
self.centroid_ = self.linear_sum_ = np.array(linear_sum, dtype=np.float64)
self.linear_sum_ = linear_sum.astype(np.float64, copy=False)
self.centroid_ = self.linear_sum_.copy() # centroid of 1 sample

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Or maybe this is an intentional optimization... In which case we could just add a comment.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Given the code quality of birch here, I doubt this is an intentional optimization.
I would use copy=True, because what if the input array X gets modified and some leaf only has a single element?
It will likely be more efficient to reuse the arrays in update then than to allocate new one each time.

@glemaitre glemaitre changed the title Add test for numerical issues in BIRCH TST Add test for numerical issues in BIRCH Aug 4, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants