Skip to content

fix(coderd): fix flake in TestAPI/ModifyAutostopWithRunningWorkspace #18932

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 1 commit into from
Jul 21, 2025

Conversation

DanielleMaywood
Copy link
Contributor

@DanielleMaywood DanielleMaywood commented Jul 21, 2025

Fixes coder/internal#521

This happened due to a race condition present in how AwaitWorkspaceBuildJobCompleted works.

AwaitWorkspaceBuildJobCompleted works by waiting until /api/v2/workspacesbuilds/{workspacebuild}/ returns a workspace build with .Job.CompletedAt != nil. The issue here is that sometimes the returned codersdk.WorkspaceBuild can contain a build from before a provisioner job completed, but contain the provisioner job from after it completed.

Let me demonstrate:

Here we query the database for database.WorkspaceBuild.

coder/coderd/coderd.go

Lines 1409 to 1415 in a3f64f7

r.Route("/workspacebuilds/{workspacebuild}", func(r chi.Router) {
r.Use(
apiKeyMiddleware,
httpmw.ExtractWorkspaceBuildParam(options.Database),
httpmw.ExtractWorkspaceParam(options.Database),
)
r.Get("/", api.workspaceBuild)

Inside of the workspaceBuild route handler, we call workspaceBuildsData

data, err := api.workspaceBuildsData(ctx, []database.WorkspaceBuild{workspaceBuild})

This then calls GetProvisionerJobsByIDsWithQueuePosition

jobs, err := api.Database.GetProvisionerJobsByIDsWithQueuePosition(ctx, database.GetProvisionerJobsByIDsWithQueuePositionParams{
IDs: jobIDs,
StaleIntervalMS: provisionerdserver.StaleInterval.Milliseconds(),
})
if err != nil && !errors.Is(err, sql.ErrNoRows) {

As these two calls happen outside of a transaction, the state of the world can change underneath. This can result in an in-progress workspace build having a completed provisioner job attached to it.

Note: The change in this PR only touches the flakey test. The underlying cause of the flake isn't being fixed. I'm happy to expand the scope of this PR to fix the cause of the flake as that might also fix other flakes.

@DanielleMaywood DanielleMaywood marked this pull request as ready for review July 21, 2025 11:18
Copy link
Contributor

@Copilot Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR fixes a flaky test TestAPI/ModifyAutostopWithRunningWorkspace caused by a race condition in AwaitWorkspaceBuildJobCompleted. The issue occurs when the API returns stale workspace build data due to non-transactional database queries between workspace builds and provisioner jobs.

  • Adds a re-fetch of the workspace build before accessing its deadline to ensure fresh data
  • Restructures the test flow to validate the deadline state before making TTL updates
Comments suppressed due to low confidence (1)

coderd/workspaces_test.go:2880

  • The variable name 'build' is being reused and shadows the original build variable from the outer scope. Consider using a more descriptive name like 'freshBuild' or 'updatedBuild' to avoid confusion.
				build, err := client.WorkspaceBuild(ctx, build.ID)

@DanielleMaywood DanielleMaywood requested a review from mtojek July 21, 2025 11:18
@DanielleMaywood DanielleMaywood merged commit f751f81 into main Jul 21, 2025
40 checks passed
@DanielleMaywood DanielleMaywood deleted the danielle/fix-workspace-update-ttl branch July 21, 2025 12:04
@github-actions github-actions bot locked and limited conversation to collaborators Jul 21, 2025
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

flake: TestWorkspaceUpdateTTL
2 participants