-
-
Notifications
You must be signed in to change notification settings - Fork 26.1k
TST Add test for numerical issues in BIRCH #19253
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
linter issues :) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the report. Could you please update this PR to include the fix you suggest to make those tests pass?
In particular is it required to up-cast the batch of input data itself from the start or is it enough to only make sure that some later allocated intermediate datastructure is np.float64
?
sklearn/cluster/tests/test_birch.py
Outdated
c = _CFSubcluster() | ||
c.update(_CFSubcluster(linear_sum=np.array([0]))) | ||
c.update(_CFSubcluster(linear_sum=np.array([1]))) | ||
assert c.radius == 0.5, c.radius |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we should not expect exact equality between too floating point numbers in tests so instead the following would be more appropriate.
assert c.radius == 0.5, c.radius | |
assert c.radius == pytest.approx(0.5) |
Also, I think the trailing , c.radius
is not necessary because pytest will already report an informative error message in case of failure.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Similar changes should be applied to all the other assert statements of this PR.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There are no possible rounding differences here. Its 1 divided by 2.
0.5 is exact here, and should be the exact result independent of the architecture.
But feel free to change this to your liking if you have a fix for this bug.
I do not have a suggested fix ready. |
This version uses a FP64 copy in the cluster features to resolve the fp16/fp32 test cases. #19251 fixes another part of this problem, by catching negative values in the sqrt() when precision is exhausted in 923729f |
@@ -284,7 +284,8 @@ def __init__(self, *, linear_sum=None): | |||
self.centroid_ = self.linear_sum_ = 0 | |||
else: | |||
self.n_samples_ = 1 | |||
self.centroid_ = self.linear_sum_ = linear_sum | |||
# defensive copy and ensure FP64 precision because of numerical issues | |||
self.centroid_ = self.linear_sum_ = np.array(linear_sum, dtype=np.float64) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reading this line, I think having self.centroid_
as a reference to the same array as self.linear_sum_
could lead to bugs. I think this does not happen because update(self, subcluster)
method below that fixes self.centroid_
by reallocating it. Still this is confusing. I think it should be:
self.centroid_ = self.linear_sum_ = np.array(linear_sum, dtype=np.float64) | |
self.linear_sum_ = linear_sum.astype(np.float64, copy=False) | |
self.centroid_ = self.linear_sum_.copy() # centroid of 1 sample |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Or maybe this is an intentional optimization... In which case we could just add a comment.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Given the code quality of birch here, I doubt this is an intentional optimization.
I would use copy=True
, because what if the input array X gets modified and some leaf only has a single element?
It will likely be more efficient to reuse the arrays in update
then than to allocate new one each time.
BIRCH has numerical issues (c.f., A. Lang, E. Schubert: BETULA: Numerically Stable CF-Trees for BIRCH Clustering. SISAP 2020: 281-296, https://arxiv.org/abs/2006.12881).
This pull request adds a unit test that demonstrates how severe the issues can become when the tree is fed input data with float16 or float32, and hence should always use at least float64 precision.