K8SPG-798 introduce retry logic on self-healing test #1181

gkech · 2025-06-24T20:17:52Z

CHANGE DESCRIPTION

Problem:

The self-healing test is very flaky, especially on this step:


er.go:42: 12:55:14 | self-healing/13-read-from-all-pods | +++ get_client_pod
    logger.go:42: 12:55:14 | self-healing/13-read-from-all-pods | +++ kubectl -n kuttl-test-unified-alpaca get pods --selector=name=pg-client -o 'jsonpath={.items[].metadata.name}'
    logger.go:42: 12:55:15 | self-healing/13-read-from-all-pods | ++ kubectl -n kuttl-test-unified-alpaca exec pg-client-84d6c45668-r2qjs -- bash -c 'printf '\''\c myapp \\\ SELECT * from myApp;\n'\'' | psql -v ON_ERROR_STOP=1 -t -q postgres://'\''postgres:wejl1rCytZCXgkUMiJeCAZNO@self-healing-instance1-g8k4-0.self-healing-pods.kuttl-test-unified-alpaca.svc'\'''
    logger.go:42: 12:55:16 | self-healing/13-read-from-all-pods | psql: error: connection to server at "self-healing-instance1-g8k4-0.self-healing-pods.kuttl-test-unified-alpaca.svc" (10.112.208.28), port 5432 failed: No route to host
    logger.go:42: 12:55:16 | self-healing/13-read-from-all-pods | 	Is the server running on that host and accepting TCP/IP connections?
    logger.go:42: 12:55:16 | self-healing/13-read-from-all-pods | command terminated with exit code 2
    logger.go:42: 12:55:16 | self-healing/13-read-from-all-pods | + data=
    case.go:378: failed in step 13-read-from-all-pods
    case.go:380: command "set -o xtrace\\n source ../../functions\\n pods=$(get_instance_set_po..." failed, exit status 2
    logger.go:42: 12:55:16 | self-healing | self-healing events from ns kuttl-test-unified-alpaca:
    logger.go:42: 12:55:16 | self-healing | 2025-06-24 12:50:01 +0000 UTC	Normal	Pod pg-client-84d6c45668-r2qjs	Binding	Scheduled	Successfully assigned kuttl-test-unified-alpaca/pg-client-84d6c45668-r2qjs to gke-jen-pg-1176-cb6df120-default-pool-694ae38a-1q87	default-scheduler	
    logger.go:42: 12:55:16 | self-healing | 2025-06-24 12:50:01 +0000 UTC	Normal	ReplicaSet.apps pg-client-84d6c45668		SuccessfulCreate	Created pod: pg-client-84d6c45668-r2qjs	replicaset-controller	
    logger.go:42: 12:55:16 | self-healing | 2025-06-24 12:50:01 +0000 UTC	Normal	Deployment.apps pg-clie

Cause:
Short explanation of the root cause of the issue if applicable.

Solution:
We are introducing retry logic on all the read from all pods steps so that we can ensure that if pods are not ready to serve the respective assertions, this will be handled gracefully by the test.

CHECKLIST

Jira

Is the Jira ticket created and referenced properly?
Does the Jira ticket have the proper statuses for documentation (Needs Doc) and QA (Needs QA)?
Does the Jira ticket link to the proper milestone (Fix Version field)?

Tests

Is an E2E test/test case added for the new feature/change?
Are unit tests added where appropriate?

Config/Logging/Testability

Are all needed new/changed options added to default YAML files?
Are all needed new/changed options added to the Helm Chart?
Did we add proper logging messages for operator actions?
Did we ensure compatibility with the previous version or cluster upgrade process?
Does the change support oldest and newest supported PG version?
Does the change support oldest and newest supported Kubernetes version?

JNKPercona · 2025-06-25T07:49:18Z

Test name	Status
backup-enable-disable	passed
custom-extensions	passed
custom-tls	passed
demand-backup	passed
finalizers	passed
init-deploy	passed
monitoring	passed
monitoring-pmm3	passed
one-pod	passed
operator-self-healing	passed
pitr	passed
scaling	passed
scheduled-backup	passed
self-healing	passed
sidecars	passed
start-from-backup	passed
tablespaces	passed
telemetry-transfer	passed
upgrade-consistency	passed
upgrade-minor	passed
users	passed
We run 21 out of 21

commit: 6317d40
image: perconalab/percona-postgresql-operator:PR-1181-6317d4002

K8SPG-798 introduce retry logic on self-healing test

d3e3dcf

gkech marked this pull request as ready for review June 24, 2025 20:32

gkech requested review from jvpasinatto, eleo007 and valmiranogueira as code owners June 24, 2025 20:32

gkech requested review from hors, egegunes, pooknull and nmarukovich June 24, 2025 20:32

eleo007 previously approved these changes Jun 25, 2025

View reviewed changes

gkech added 2 commits June 25, 2025 09:24

Merge branch 'main' into K8SPG-798

b222b72

remove errexit

6317d40

gkech dismissed eleo007’s stale review via 6317d40 June 25, 2025 06:25

gkech requested a review from eleo007 June 25, 2025 07:44

hors approved these changes Jun 25, 2025

View reviewed changes

hors merged commit c4df12a into main Jun 25, 2025
18 of 19 checks passed

hors deleted the K8SPG-798 branch June 25, 2025 09:04

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

K8SPG-798 introduce retry logic on self-healing test #1181

K8SPG-798 introduce retry logic on self-healing test #1181

Uh oh!

gkech commented Jun 24, 2025 •

edited

Loading

Uh oh!

JNKPercona commented Jun 25, 2025

Uh oh!

Uh oh!

Uh oh!

K8SPG-798 introduce retry logic on self-healing test #1181

K8SPG-798 introduce retry logic on self-healing test #1181

Uh oh!

Conversation

gkech commented Jun 24, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

CHANGE DESCRIPTION

CHECKLIST

Uh oh!

JNKPercona commented Jun 25, 2025

Uh oh!

Uh oh!

Uh oh!

gkech commented Jun 24, 2025 •

edited

Loading