Skip to content

Feature Request: Configurable Retry and Backoff for External Auth Refresh Token Operations #18811

@bjornrobertsson

Description

@bjornrobertsson

Problem Statement

External auth token refresh failures in Coder result in immediate token deletion (https://github.com/coder/coder/blob/main/coderd/externalauth/externalauth.go#L145) without retry mechanisms, causing:

  • Workspace Authentication Failure: Users lose access until manual re-authentication
  • Workspace Startup Delays: Git operations hang indefinitely with expired tokens, when using external auth exclusively.
  • Poor User Experience: Frequent manual re-authentication required during provider maintenance
    Operational Burden: High support volume for "expired token" issues.

This aligns closely with the requirements for Bug, since the removal of the token was done to remediate the behaviour when hitting GitHub rate limits, but ultimately affects (non-GitHub, all) users negatively - when the OIDC Provider has a long outage or maintenance. This goes counter to the idea of the refresh-token having no expiration, and by removing the refresh-token can be seen as violating this idea of a 'remaining refresh-token'.

Where this diverges from regression and asking for the previous behaviour to be re-instated (to the detriment of GitHub users), the request is to add backoff, configuration settings and observability.

Current Behaviour - or lack of

No retry attempts for transient failures

  • Does not allow for temporary networking issue, transient access or OIDC/SAML Upstream provider having other problems.
    No UI recovery mechanism other than re-authenticate
    Limited visibility or none (i.e. startup script) into failure causes

Proposed Solution

  • Configurable Retry ParametersAdd environment variables for each external auth provider:
# Base configuration (existing)
CODER_EXTERNAL_AUTH_0_ID="aws-gitlab"
CODER_EXTERNAL_AUTH_0_CLIENT_SECRET="xxx"

New retry configuration

CODER_EXTERNAL_AUTH_0_RETRY_ATTEMPTS=5        # Default: 3
CODER_EXTERNAL_AUTH_0_RETRY_INITIAL_DELAY=30s # Default: 30s
CODER_EXTERNAL_AUTH_0_RETRY_MAX_DELAY=300s    # Default: 5 minutes
CODER_EXTERNAL_AUTH_0_RETRY_BACKOFF_FACTOR=2  # Default: 2
CODER_EXTERNAL_AUTH_0_RETRY_JITTER=true       # Default: true

2. Exponential Backoff with Jitter
Implement truncated exponential backoff:

Start with initial delay (30s)
Apply configurable multiplier (2x)
Cap at maximum delay (5 minutes)
Add randomization to prevent thundering herd

3. Token Preservation Strategy
Critical: Preserve refresh tokens during retry window

Maintain workspace access during provider outages
Enable manual re-authentication during retries
Delete only after all attempts exhausted and sufficient time passes that it's unlikely to ever succeed (since refresh-tokens do not expire, this period should be extensive, counted in days)

4. Enhanced Observability

Structured logging for retry operations:

[info] external-auth: attempting token refresh provider=aws-gitlab user=user@example.com attempt=1/5
[warn] external-auth: token refresh failed provider=aws-gitlab user=user@example.com attempt=1/5 error="context canceled" next_retry_in=30s
[info] external-auth: token refresh succeeded provider=aws-gitlab user=user@example.com attempt=2/5
[error] external-auth: token refresh exhausted all attempts provider=aws-gitlab user=user@example.com, clearing refresh token

5. Background Retry Process

Implement background service for:

Periodic failed token checks
Scheduled retry attempts
Database status updates
User failure notifications

Multi-Provider Support
Support multiple providers using existing suffix pattern:

# GitLab provider
CODER_EXTERNAL_AUTH_0_RETRY_ATTEMPTS=5
CODER_EXTERNAL_AUTH_0_RETRY_MAX_DELAY=300s

# GitHub provider
CODER_EXTERNAL_AUTH_1_RETRY_ATTEMPTS=3
CODER_EXTERNAL_AUTH_1_RETRY_MAX_DELAY=120s

Implementation Requirements

Backward Compatibility

Sensible defaults for all retry parameters
Must not cause service interruptions if the retry is not configured, or use 'similar' behaviour to avoid breaking changes.

Success Criteria

Reduced Support Burden: Eliminate frequent "expired token" requests
Improved Workspace Reliability: Successful Git operations during upstream provider outages
Better User Experience: Continuous access during brief maintenance windows
Operational Visibility: Clear retry attempt logging
Configurable Behavior: Provider-specific retry strategies

Related Issues
Issue #17069: GitHub external auth intermittently fails to refresh token
Issue #12787: No way to fetch external auth refresh token
Issue #14982: GitHub Rate limit

Metadata

Metadata

Assignees

No one assigned

    Labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions