Skip to content

Bug Report: Race Condition in Coder HA Setup During Vault-Managed PostgreSQL Password Rotation #19030

@bjornrobertsson

Description

@bjornrobertsson

Summary

A race condition occurs in Coder's high availability (HA) deployment when PostgreSQL password rotation is managed by HashiCorp Vault.

During password rotation, apps (jupyter-notebook, code-server) become inaccessible with infinite loading, requiring manual pod restart to resolve.

No similar issue has been found.

Environment

  • Coder Version: 2.20.2
  • Deployment: Kubernetes with HA (2 replicas)
  • Database: PostgreSQL 17
  • Secret Management: HashiCorp Vault with VaultDynamicSecret
  • Network: Air-gapped environment
  • Authentication: OIDC with GitLab

Issue Description

Problem

When Vault rotates the PostgreSQL database password and triggers a rollout restart of coder pods, a race condition occurs between the two Coder instances during the replica synchronization process. This results in:

  1. Workspace apps become inaccessible: jupyter-notebook and code-server show infinite loading
  2. DERP health check instability: Switches between healthy/unhealthy states
  3. Replica sync failures: Error messages indicating communication issues between replicas
  4. Authentication issues: Apps return 502 errors with "Back to site" HTML responses

Root Cause Analysis

The issue appears to be related to the replicasync process between Coder instances during password rotation. Key evidence:

  1. Failed sibling replica pings:

    coderd: failed to ping sibling replica, this could happen if the replica has shutdown
    error= do probe: Get "http://192.A.X.Y:$PORT/derp/latency-check": context deadline exceeded
    
  2. Coordinator heartbeat failures:

    coderd.pgcoord: coordinator failed heartbeat check coordinator_id=$UUID
    
  3. DERP connectivity issues:

    net.tailnet.net.wgengine: [unexpected] magicsock: derp-999 does not know about peer [2OmTQ], removing route
    

Reproduction Steps

  1. Deploy Coder in HA mode (2 replicas) with PostgreSQL
  2. Configure Vault to manage PostgreSQL password rotation with VaultDynamicSecret
  3. Set up rolloutRestartTargets to restart Coder deployment on password change
  4. Trigger password rotation (manually or wait for scheduled rotation)
  5. Observe that Apps become inaccessible despite successful pod restart

Technical Details

Race Condition Mechanism

During rolling updates with database password changes, the following condition occurs:

  1. Vault rotates PostgreSQL password
  2. Rolling restart begins (one pod at a time)
  3. First pod restarts with new password, second pod still has old connection context
  4. Replica synchronization fails due to inconsistent database connection states
  5. DERP network coordination becomes unstable
  6. Workspace connectivity breaks

Failed Workarounds

  1. Rolling Update Strategy: Adding terminationGracePeriodSeconds: 120 and proper rolling update configuration didn't resolve the issue
  2. Deployment Strategy Change: Switching to type: Recreate initially worked but caused other instability issues with continuous pod restarts

Expected vs Actual Behavior

Expected: After PostgreSQL password rotation and pod restart, Apps should remain accessible with minimal downtime.

Actual: Apps become completely inaccessible with infinite loading, requiring manual intervention (pod deletion/restart) to restore functionality.

Error Messages and Logs

Coder Pod Logs

coderd: failed to ping sibling replica, this could happen if the replica has shutdown
coderd.pgcoord: coordinator failed heartbeat check
coderd: requester is not authorized to access the object

Workspace Agent Logs

net.tailnet.net.wgengine: [unexpected] magicsock: derp-999 does not know about peer [2OmTQ], removing route
net.tailnet.net.wgengine: wg: [v2] Received message with unknown type

HTTP Responses

GET /@user/workspace/apps/jupyter-notebook/api/events/subscribe
Status: 502
Response: ">Back to site</a>"

Impact

  • High: Complete loss of workspace app functionality during password rotation
  • Business Critical: Affects all users in air-gapped production environment
  • Security Impact: Prevents automated password rotation compliance

Suggested Solutions

Immediate Workaround

Use manual pod deletion instead of rolling restart:

kubectl delete pods -l app=coder -n coder-namespace

Proposed Fixes

  1. Implement graceful replica sync during password rotation

    • Add coordination mechanism between replicas during database credential changes
    • Ensure consistent database connection state across all instances
  2. Enhance DERP relay stability during restarts

    • Improve error handling in enterprise/replicasync/replicasync.go
    • Add retry mechanisms for failed peer connections
  3. Add password rotation awareness

    • Detect database credential changes and coordinate replica restart sequence
    • Implement proper cleanup of stale connection pools

Code References

Suspected components based on error messages:

  • enterprise/replicasync/replicasync.go:381 - Peer replica ping logic
  • cmd/pgcoord/main.go - PostgreSQL coordinator
  • internal/db/pgcoord/pgcoord.go - Database coordination logic

Additional Context

This issue specifically affects HA deployments with external secret management systems like Vault.

The condition appears to be timing-dependent and related to the coordination between multiple Coder instances during database authentication changes.

The issue doesn't occur with single-instance deployments or when database credentials remain static, indication that this is an HA-specific race condition during credential rotation scenarios.

Metadata

Metadata

Assignees

No one assigned

    Labels

    customer-reportedBugs reported by enterprise customers. Only humans may set this.

    Type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions