Bug Report: Race Condition in Coder HA Setup During Vault-Managed PostgreSQL Password Rotation

## Summary
A race condition occurs in Coder's high availability (HA) deployment when PostgreSQL password rotation is managed by HashiCorp Vault.

During password rotation, apps (jupyter-notebook, code-server) become inaccessible with infinite loading, requiring manual pod restart to resolve.

No similar issue has been found.

## Environment
- **Coder Version**: 2.20.2
- **Deployment**: Kubernetes with HA (2 replicas)
- **Database**: PostgreSQL 17
- **Secret Management**: HashiCorp Vault with VaultDynamicSecret
- **Network**: Air-gapped environment
- **Authentication**: OIDC with GitLab

## Issue Description

### Problem
When Vault rotates the PostgreSQL database password and triggers a rollout restart of coder pods, a race condition occurs between the two Coder instances during the replica synchronization process. This results in:

1. **Workspace apps become inaccessible**: jupyter-notebook and code-server show infinite loading
2. **DERP health check instability**: Switches between healthy/unhealthy states
3. **Replica sync failures**: Error messages indicating communication issues between replicas
4. **Authentication issues**: Apps return 502 errors with "Back to site" HTML responses

### Root Cause Analysis
The issue appears to be related to the **replicasync** process between Coder instances during password rotation. Key evidence:

1. **Failed sibling replica pings**:
   ```
   coderd: failed to ping sibling replica, this could happen if the replica has shutdown
   error= do probe: Get "http://192.A.X.Y:$PORT/derp/latency-check": context deadline exceeded
   ```

2. **Coordinator heartbeat failures**:
   ```
   coderd.pgcoord: coordinator failed heartbeat check coordinator_id=$UUID
   ```

3. **DERP connectivity issues**:
   ```
   net.tailnet.net.wgengine: [unexpected] magicsock: derp-999 does not know about peer [2OmTQ], removing route
   ```

### Reproduction Steps
1. Deploy Coder in HA mode (2 replicas) with PostgreSQL
2. Configure Vault to manage PostgreSQL password rotation with `VaultDynamicSecret`
3. Set up `rolloutRestartTargets` to restart Coder deployment on password change
4. Trigger password rotation (manually or wait for scheduled rotation)
5. Observe that Apps become inaccessible despite successful pod restart

## Technical Details

### Race Condition Mechanism

During rolling updates with database password changes, the following condition occurs:

1. Vault rotates PostgreSQL password
2. Rolling restart begins (one pod at a time)
3. First pod restarts with new password, second pod still has old connection context
4. Replica synchronization fails due to inconsistent database connection states
5. DERP network coordination becomes unstable
6. Workspace connectivity breaks

### Failed Workarounds
1. **Rolling Update Strategy**: Adding `terminationGracePeriodSeconds: 120` and proper rolling update configuration didn't resolve the issue
2. **Deployment Strategy Change**: Switching to `type: Recreate` initially worked but caused other instability issues with continuous pod restarts

## Expected vs Actual Behavior

**Expected**: After PostgreSQL password rotation and pod restart, Apps should remain accessible with minimal downtime.

**Actual**: Apps become completely inaccessible with infinite loading, requiring manual intervention (pod deletion/restart) to restore functionality.

## Error Messages and Logs

### Coder Pod Logs
```
coderd: failed to ping sibling replica, this could happen if the replica has shutdown
coderd.pgcoord: coordinator failed heartbeat check
coderd: requester is not authorized to access the object
```

### Workspace Agent Logs
```
net.tailnet.net.wgengine: [unexpected] magicsock: derp-999 does not know about peer [2OmTQ], removing route
net.tailnet.net.wgengine: wg: [v2] Received message with unknown type
```

### HTTP Responses
```
GET /@user/workspace/apps/jupyter-notebook/api/events/subscribe
Status: 502
Response: ">Back to site</a>"
```

## Impact
- **High**: Complete loss of workspace app functionality during password rotation
- **Business Critical**: Affects all users in air-gapped production environment
- **Security Impact**: Prevents automated password rotation compliance

## Suggested Solutions

### Immediate Workaround
Use manual pod deletion instead of rolling restart:
```bash
kubectl delete pods -l app=coder -n coder-namespace
```

### Proposed Fixes
1. **Implement graceful replica sync during password rotation**
   - Add coordination mechanism between replicas during database credential changes
   - Ensure consistent database connection state across all instances

2. **Enhance DERP relay stability during restarts**
   - Improve error handling in `enterprise/replicasync/replicasync.go`
   - Add retry mechanisms for failed peer connections

3. **Add password rotation awareness**
   - Detect database credential changes and coordinate replica restart sequence
   - Implement proper cleanup of stale connection pools

## Code References
Suspected components based on error messages:
- `enterprise/replicasync/replicasync.go:381` - Peer replica ping logic
- `cmd/pgcoord/main.go` - PostgreSQL coordinator
- `internal/db/pgcoord/pgcoord.go` - Database coordination logic

## Additional Context
This issue specifically affects HA deployments with external secret management systems like Vault.

The condition appears to be timing-dependent and related to the coordination between multiple Coder instances during database authentication changes.

The issue doesn't occur with single-instance deployments or when database credentials remain static, indication that this is an HA-specific race condition during credential rotation scenarios.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Bug Report: Race Condition in Coder HA Setup During Vault-Managed PostgreSQL Password Rotation #19030

Summary

Environment

Issue Description

Problem

Root Cause Analysis

Reproduction Steps

Technical Details

Race Condition Mechanism

Failed Workarounds

Expected vs Actual Behavior

Error Messages and Logs

Coder Pod Logs

Workspace Agent Logs

HTTP Responses

Impact

Suggested Solutions

Immediate Workaround

Proposed Fixes

Code References

Additional Context

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Bug Report: Race Condition in Coder HA Setup During Vault-Managed PostgreSQL Password Rotation #19030

Description

Summary

Environment

Issue Description

Problem

Root Cause Analysis

Reproduction Steps

Technical Details

Race Condition Mechanism

Failed Workarounds

Expected vs Actual Behavior

Error Messages and Logs

Coder Pod Logs

Workspace Agent Logs

HTTP Responses

Impact

Suggested Solutions

Immediate Workaround

Proposed Fixes

Code References

Additional Context

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions