-
Notifications
You must be signed in to change notification settings - Fork 952
Description
Summary
A race condition occurs in Coder's high availability (HA) deployment when PostgreSQL password rotation is managed by HashiCorp Vault.
During password rotation, apps (jupyter-notebook, code-server) become inaccessible with infinite loading, requiring manual pod restart to resolve.
No similar issue has been found.
Environment
- Coder Version: 2.20.2
- Deployment: Kubernetes with HA (2 replicas)
- Database: PostgreSQL 17
- Secret Management: HashiCorp Vault with VaultDynamicSecret
- Network: Air-gapped environment
- Authentication: OIDC with GitLab
Issue Description
Problem
When Vault rotates the PostgreSQL database password and triggers a rollout restart of coder pods, a race condition occurs between the two Coder instances during the replica synchronization process. This results in:
- Workspace apps become inaccessible: jupyter-notebook and code-server show infinite loading
- DERP health check instability: Switches between healthy/unhealthy states
- Replica sync failures: Error messages indicating communication issues between replicas
- Authentication issues: Apps return 502 errors with "Back to site" HTML responses
Root Cause Analysis
The issue appears to be related to the replicasync process between Coder instances during password rotation. Key evidence:
-
Failed sibling replica pings:
coderd: failed to ping sibling replica, this could happen if the replica has shutdown error= do probe: Get "http://192.A.X.Y:$PORT/derp/latency-check": context deadline exceeded
-
Coordinator heartbeat failures:
coderd.pgcoord: coordinator failed heartbeat check coordinator_id=$UUID
-
DERP connectivity issues:
net.tailnet.net.wgengine: [unexpected] magicsock: derp-999 does not know about peer [2OmTQ], removing route
Reproduction Steps
- Deploy Coder in HA mode (2 replicas) with PostgreSQL
- Configure Vault to manage PostgreSQL password rotation with
VaultDynamicSecret
- Set up
rolloutRestartTargets
to restart Coder deployment on password change - Trigger password rotation (manually or wait for scheduled rotation)
- Observe that Apps become inaccessible despite successful pod restart
Technical Details
Race Condition Mechanism
During rolling updates with database password changes, the following condition occurs:
- Vault rotates PostgreSQL password
- Rolling restart begins (one pod at a time)
- First pod restarts with new password, second pod still has old connection context
- Replica synchronization fails due to inconsistent database connection states
- DERP network coordination becomes unstable
- Workspace connectivity breaks
Failed Workarounds
- Rolling Update Strategy: Adding
terminationGracePeriodSeconds: 120
and proper rolling update configuration didn't resolve the issue - Deployment Strategy Change: Switching to
type: Recreate
initially worked but caused other instability issues with continuous pod restarts
Expected vs Actual Behavior
Expected: After PostgreSQL password rotation and pod restart, Apps should remain accessible with minimal downtime.
Actual: Apps become completely inaccessible with infinite loading, requiring manual intervention (pod deletion/restart) to restore functionality.
Error Messages and Logs
Coder Pod Logs
coderd: failed to ping sibling replica, this could happen if the replica has shutdown
coderd.pgcoord: coordinator failed heartbeat check
coderd: requester is not authorized to access the object
Workspace Agent Logs
net.tailnet.net.wgengine: [unexpected] magicsock: derp-999 does not know about peer [2OmTQ], removing route
net.tailnet.net.wgengine: wg: [v2] Received message with unknown type
HTTP Responses
GET /@user/workspace/apps/jupyter-notebook/api/events/subscribe
Status: 502
Response: ">Back to site</a>"
Impact
- High: Complete loss of workspace app functionality during password rotation
- Business Critical: Affects all users in air-gapped production environment
- Security Impact: Prevents automated password rotation compliance
Suggested Solutions
Immediate Workaround
Use manual pod deletion instead of rolling restart:
kubectl delete pods -l app=coder -n coder-namespace
Proposed Fixes
-
Implement graceful replica sync during password rotation
- Add coordination mechanism between replicas during database credential changes
- Ensure consistent database connection state across all instances
-
Enhance DERP relay stability during restarts
- Improve error handling in
enterprise/replicasync/replicasync.go
- Add retry mechanisms for failed peer connections
- Improve error handling in
-
Add password rotation awareness
- Detect database credential changes and coordinate replica restart sequence
- Implement proper cleanup of stale connection pools
Code References
Suspected components based on error messages:
enterprise/replicasync/replicasync.go:381
- Peer replica ping logiccmd/pgcoord/main.go
- PostgreSQL coordinatorinternal/db/pgcoord/pgcoord.go
- Database coordination logic
Additional Context
This issue specifically affects HA deployments with external secret management systems like Vault.
The condition appears to be timing-dependent and related to the coordination between multiple Coder instances during database authentication changes.
The issue doesn't occur with single-instance deployments or when database credentials remain static, indication that this is an HA-specific race condition during credential rotation scenarios.