Skip to content

K8SPG-798 introduce retry logic on self-healing test #1181

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 3 commits into from
Jun 25, 2025
Merged

K8SPG-798 introduce retry logic on self-healing test #1181

merged 3 commits into from
Jun 25, 2025

Conversation

gkech
Copy link
Contributor

@gkech gkech commented Jun 24, 2025

K8SPG-798 Powered by Pull Request Badge

CHANGE DESCRIPTION

Problem:

The self-healing test is very flaky, especially on this step:


er.go:42: 12:55:14 | self-healing/13-read-from-all-pods | +++ get_client_pod
    logger.go:42: 12:55:14 | self-healing/13-read-from-all-pods | +++ kubectl -n kuttl-test-unified-alpaca get pods --selector=name=pg-client -o 'jsonpath={.items[].metadata.name}'
    logger.go:42: 12:55:15 | self-healing/13-read-from-all-pods | ++ kubectl -n kuttl-test-unified-alpaca exec pg-client-84d6c45668-r2qjs -- bash -c 'printf '\''\c myapp \\\ SELECT * from myApp;\n'\'' | psql -v ON_ERROR_STOP=1 -t -q postgres://'\''postgres:wejl1rCytZCXgkUMiJeCAZNO@self-healing-instance1-g8k4-0.self-healing-pods.kuttl-test-unified-alpaca.svc'\'''
    logger.go:42: 12:55:16 | self-healing/13-read-from-all-pods | psql: error: connection to server at "self-healing-instance1-g8k4-0.self-healing-pods.kuttl-test-unified-alpaca.svc" (10.112.208.28), port 5432 failed: No route to host
    logger.go:42: 12:55:16 | self-healing/13-read-from-all-pods | 	Is the server running on that host and accepting TCP/IP connections?
    logger.go:42: 12:55:16 | self-healing/13-read-from-all-pods | command terminated with exit code 2
    logger.go:42: 12:55:16 | self-healing/13-read-from-all-pods | + data=
    case.go:378: failed in step 13-read-from-all-pods
    case.go:380: command "set -o xtrace\\n source ../../functions\\n pods=$(get_instance_set_po..." failed, exit status 2
    logger.go:42: 12:55:16 | self-healing | self-healing events from ns kuttl-test-unified-alpaca:
    logger.go:42: 12:55:16 | self-healing | 2025-06-24 12:50:01 +0000 UTC	Normal	Pod pg-client-84d6c45668-r2qjs	Binding	Scheduled	Successfully assigned kuttl-test-unified-alpaca/pg-client-84d6c45668-r2qjs to gke-jen-pg-1176-cb6df120-default-pool-694ae38a-1q87	default-scheduler	
    logger.go:42: 12:55:16 | self-healing | 2025-06-24 12:50:01 +0000 UTC	Normal	ReplicaSet.apps pg-client-84d6c45668		SuccessfulCreate	Created pod: pg-client-84d6c45668-r2qjs	replicaset-controller	
    logger.go:42: 12:55:16 | self-healing | 2025-06-24 12:50:01 +0000 UTC	Normal	Deployment.apps pg-clie

Cause:
Short explanation of the root cause of the issue if applicable.

Solution:
We are introducing retry logic on all the read from all pods steps so that we can ensure that if pods are not ready to serve the respective assertions, this will be handled gracefully by the test.

Screenshot 2025-06-25 at 10 43 33 AM

CHECKLIST

Jira

  • Is the Jira ticket created and referenced properly?
  • Does the Jira ticket have the proper statuses for documentation (Needs Doc) and QA (Needs QA)?
  • Does the Jira ticket link to the proper milestone (Fix Version field)?

Tests

  • Is an E2E test/test case added for the new feature/change?
  • Are unit tests added where appropriate?

Config/Logging/Testability

  • Are all needed new/changed options added to default YAML files?
  • Are all needed new/changed options added to the Helm Chart?
  • Did we add proper logging messages for operator actions?
  • Did we ensure compatibility with the previous version or cluster upgrade process?
  • Does the change support oldest and newest supported PG version?
  • Does the change support oldest and newest supported Kubernetes version?

@gkech gkech marked this pull request as ready for review June 24, 2025 20:32
eleo007
eleo007 previously approved these changes Jun 25, 2025
@JNKPercona
Copy link
Collaborator

Test name Status
backup-enable-disable passed
custom-extensions passed
custom-tls passed
demand-backup passed
finalizers passed
init-deploy passed
monitoring passed
monitoring-pmm3 passed
one-pod passed
operator-self-healing passed
pitr passed
scaling passed
scheduled-backup passed
self-healing passed
sidecars passed
start-from-backup passed
tablespaces passed
telemetry-transfer passed
upgrade-consistency passed
upgrade-minor passed
users passed
We run 21 out of 21

commit: 6317d40
image: perconalab/percona-postgresql-operator:PR-1181-6317d4002

@hors hors merged commit c4df12a into main Jun 25, 2025
18 of 19 checks passed
@hors hors deleted the K8SPG-798 branch June 25, 2025 09:04
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants