-
Notifications
You must be signed in to change notification settings - Fork 3.9k
Collection of test fixes #14206
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Draft
dumbbell
wants to merge
15
commits into
main
Choose a base branch
from
fix-test-flakes-2025Q2
base: main
Could not load branches
Branch not found: {{ refName }}
Loading
Could not load tags
Nothing to show
Loading
Are you sure you want to change the base?
Some commits from the old base branch may be removed from the timeline,
and old review comments may become outdated.
Draft
Collection of test fixes #14206
Conversation
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
dc3e3e4
to
4e23f8d
Compare
a4a2362
to
dc63164
Compare
bd31400
to
7236f5d
Compare
[Why] Several lines were crossing the 80-columns boundary, plus messages without a capital first letter.
[Why] When `wait_for_messages_ready/3` returns, we are sure that the replicas are in the expected state. However, the `#amqqueue{}` record is updated in Khepri, we don't know when all Khepri store members will be up-to-date. It can happen that `Server0` is not up-to-date when we query that record to get the list of replicass, leading to a test failure. [How] First, the check is moved to its own function is `queue_utils`. Then, if Khepri is being used, we use a Khepri fence to ensure previous operations were applied on the given server. This way, we get a consistent view of the `#amqqueue{}` record and thus the list of replicas.
d5f4501
to
49d7d64
Compare
... in `remove_node_when_seed_node_is_leader/1` and `remove_node_when_seed_node_is_follower/1`. [Why] The check was performed after the partition so far. It was incorrect because if a cluster change was not permitted at the time of the partition, it would not be afterwards. Thus there was a race condition here. [How] Now, the check is performed before the partition. Thanks to this new approach, we are sure of the state of node A and don't need the cass block near the end of the test cases. This should fix some test flakes we see locally and in CI.
[Why] The default checkpoint interval is 16384. Therefore with 20,000 messages published by the testcase, there is a chance a checkpoint is created. This would hit an assertion in the testcase which expects no checkpoints before it forces the creation of one. We see this happening in CI. Not locally because the testcase runs fast enough. [How] The testcase now sends 10,000 messages. This is still a lot of messages while staying under the default checkpoint interval.
…de_list_in_user` [Why] This was the only place where a condition was checked once after a connection close, instead of waiting for it to become true. This caused some transient failures in CI when the connection tracking took a bit of time to update and the check was performed before that.
[Why] The tests relied on `rabbit_ct_client_helpers` connection and channel manager which doesn't seem to be robust. It causes more harm than helps so far. Hopefully, this will fix some test flakes in CI.
[Why] The tests relied on `rabbit_ct_client_helpers` connection and channel manager which doesn't seem to be robust. It causes more harm than helps so far. Hopefully, this will fix some test flakes in CI.
[Why] It looks like `erlang_vm_dist_node_queue_size_bytes` is not always present, even though other Erlang-specific metrics are present. [How] The goal is to ensure Erlang metrics are present in the output, so just use another one that is likely to be there.
[Why] ehie flaked today since the restart took 309ms, thus above the allowed 100ms (outside of CI, it takes single-digit ms) [How] Increase the allowed time but also significantly increase next_seq_id. This test exists because in the past we had an O(n) algorithm in CQ recovery, leading to a slow recovery of even empty queues, if they had a very large next_seq_id. Now that this operation is O(1), a much larger next_seq_id shouldn't affect the time it takes to run this test, while accidentally re-introducing an O(n) algorithm should fail this test consistently.
[Why] In the `node_channel_limit` testcase, we open several channels and verify the count of opened channels in all places but one: after the first connection failure, when we try to open 3 channels. Opening 3 channels in a row might not be tracked in time to reject the third channel because the counter is updated asynchronously. [How] We simply wait for the counter to reach 5 before opening the third channel. We change all checks to use `?awaitMatch/3` in the process to be more robust with timing issues.
... when testing vhost limits [Why] The tracking is aynchronous, thus the third MQTT connection might be opened before the tracking is up-to-date, which the testcase doesn't expect.
Prior to this commit, the following test case flaked: ``` make -C deps/rabbitmq_mqtt ct-v5 t=cluster_size_1:session_upgrade_v3_v5_qos1 ``` The test case failed with: ``` {v5_SUITE,session_upgrade_v3_v5_qos,1112} {test_case_failed,Received unexpected PUBLISH payload. Expected: <<"2">> Got: <<"1">>} ``` The broker logs showed: ``` 2025-07-15 15:50:23.914152+00:00 [debug] <0.758.0> MQTT accepting TCP connection <0.758.0> (127.0.0.1:38594 -> 127.0.0.1:27005) 2025-07-15 15:50:23.914289+00:00 [debug] <0.758.0> Received a CONNECT, client ID: session_upgrade_v3_v5_qos, username: undefined, clean start: false, protocol version: 3, keepalive: 60, property names: [] 2025-07-15 15:50:23.914403+00:00 [debug] <0.758.0> MQTT connection 127.0.0.1:38594 -> 127.0.0.1:27005 picked vhost using plugin_configuration_or_default_vhost 2025-07-15 15:50:23.914480+00:00 [debug] <0.758.0> User 'guest' authenticated successfully by backend rabbit_auth_backend_internal 2025-07-15 15:50:23.914641+00:00 [info] <0.758.0> Accepted MQTT connection 127.0.0.1:38594 -> 127.0.0.1:27005 for client ID session_upgrade_v3_v5_qos 2025-07-15 15:50:23.914977+00:00 [debug] <0.758.0> Received a SUBSCRIBE with subscription(s) [{mqtt_subscription, 2025-07-15 15:50:23.914977+00:00 [debug] <0.758.0> <<"session_upgrade_v3_v5_qos">>, 2025-07-15 15:50:23.914977+00:00 [debug] <0.758.0> {mqtt_subscription_opts,1,false, 2025-07-15 15:50:23.914977+00:00 [debug] <0.758.0> false,0,undefined}}] 2025-07-15 15:50:23.924503+00:00 [debug] <0.764.0> MQTT accepting TCP connection <0.764.0> (127.0.0.1:38608 -> 127.0.0.1:27005) 2025-07-15 15:50:23.924922+00:00 [debug] <0.764.0> Received a CONNECT, client ID: session_upgrade_v3_v5_qos, username: undefined, clean start: false, protocol version: 5, keepalive: 60, property names: [] 2025-07-15 15:50:23.925589+00:00 [error] <0.758.0> writing to MQTT socket #Port<0.63> failed: closed 2025-07-15 15:50:23.925635+00:00 [debug] <0.764.0> MQTT connection 127.0.0.1:38608 -> 127.0.0.1:27005 picked vhost using plugin_configuration_or_default_vhost 2025-07-15 15:50:23.925670+00:00 [info] <0.758.0> MQTT connection <<"127.0.0.1:38594 -> 127.0.0.1:27005">> will terminate because peer closed TCP connection 2025-07-15 15:50:23.925727+00:00 [debug] <0.764.0> User 'guest' authenticated successfully by backend rabbit_auth_backend_internal 2025-07-15 15:50:24.000790+00:00 [info] <0.764.0> Accepted MQTT connection 127.0.0.1:38608 -> 127.0.0.1:27005 for client ID session_upgrade_v3_v5_qos 2025-07-15 15:50:24.016553+00:00 [warning] <0.764.0> MQTT disconnecting client <<"127.0.0.1:38608 -> 127.0.0.1:27005">> with client ID 'session_upgrade_v3_v5_qos', reason: normal ``` This shows evidence that the MQTT server connection did not process the DISCONNECT packet. The hypothesis is that the server connection did not even process the PUBACK packet from the client. Hence, the first message got requeued and re-delivered to the new v5 client. This commit fixes this flake by not acking the first message. Hence, we always expect that the first message will be redelivered to the new v5 client.
[Why] Sometimes, at least in CI, it looks like the output of the CLI is prepended with a newline, sometimes not. This breaks the check of that output. [How] We just trim the output before parsing it. The parsing already takes care of trimming internal whitespaces.
... in several test cases. [Why] In CI or any slow and/or busy environment, it may take time for the ETS tables to ge updated.
49d7d64
to
87fcd0b
Compare
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
This pull request addresses the test flakes that appeared in the past couple months.