Fix non-abort on slurm tests #925

sbryngelson · 2025-07-08T13:13:19Z

User description

Some GH self-hosted runners (slurm cases) will not abort to the slurm job after running the tests (they also don't print the failed cases). As a result, the job keeps going until the wall time max is reached and failed tests are never printed. This should fix that.

PR Type

Bug fix

Description

Fix thread hanging issues in SLURM test execution
Add proper thread joining with timeout mechanism
Improve exception handling and error reporting
Track thread completion status for better debugging

Changes diagram

flowchart LR
  A["WorkerThread"] --> B["Enhanced Exception Tracking"]
  B --> C["Thread Join with Timeout"]
  C --> D["Completion Status Check"]
  D --> E["Prevent SLURM Hangs"]

Changes walkthrough 📝

Relevant files

Bug fix

sched.py `Enhanced thread management and exception handling` toolchain/mfc/sched.py Enhanced `WorkerThread` class with completion tracking and full exception info Added timeout-based thread joining to prevent infinite hangs Improved exception handling to distinguish test failures from system errors Added proper error reporting with traceback information	+40/-9

Need help?
Type /help how to ... in the comments thread for any questions about Qodo Merge usage.
Check out the documentation for more information.

qodo-merge-pro · 2025-07-08T13:13:44Z

PR Reviewer Guide 🔍

Here are some key observations to aid the review process:

⏱️ Estimated effort to review: 3 🔵🔵🔵⚪⚪
🧪 No relevant tests
🔒 No security concerns identified
⚡ Recommended focus areas for review Logic Error The exception handling logic at lines 74-84 has a potential issue where `completed_successfully` is only set to True after the target function completes without exception, but the code checks this flag when an exception occurred. This creates a logical contradiction that may prevent proper error propagation. if threadHolder.thread.completed_successfully: # Test framework handled the exception gracefully (e.g., test failure) # Don't re-raise - this is expected behavior pass else: # Unhandled exception - this indicates a real problem if hasattr(threadHolder.thread, 'exc_info') and threadHolder.thread.exc_info: error_msg = f"Worker thread {threadID} failed with unhandled exception:\n{threadHolder.thread.exc_info}" raise Exception(error_msg) from threadHolder.thread.exc else: raise threadHolder.thread.exc Timeout Handling The 30-second timeout for thread joining may be too short for legitimate long-running tests, potentially causing false positives. The timeout value should be configurable or based on test characteristics rather than hardcoded. threadHolder.thread.join(timeout=30.0) # 30 second timeout # Double-check that thread actually finished joining if threadHolder.thread.is_alive(): # Thread didn't finish within timeout - this is a serious issue raise Exception(f"Thread {threadID} failed to join within 30 seconds timeout. " f"Thread may be hung or in an inconsistent state.") except Exception as join_exc: # Handle join-specific exceptions with more context raise Exception(f"Failed to join thread {threadID}: {join_exc}. " f"This may indicate a system threading issue or hung test case.") Exception Chaining The exception handling creates nested exceptions with potentially confusing error messages. The `raise Exception(error_msg) from threadHolder.thread.exc` pattern may obscure the original exception details that are important for debugging test failures. if hasattr(threadHolder.thread, 'exc_info') and threadHolder.thread.exc_info: error_msg = f"Worker thread {threadID} failed with unhandled exception:\n{threadHolder.thread.exc_info}" raise Exception(error_msg) from threadHolder.thread.exc else: raise threadHolder.thread.exc

qodo-merge-pro · 2025-07-08T13:14:26Z

PR Code Suggestions ✨

Explore these optional code suggestions:

Category	Suggestion	Impact
General	Fix exception handling scope issue The exception handling catches all exceptions including the timeout exception raised in the if block, creating a confusing error message chain. The timeout check should be outside the try-except block to avoid catching its own exception. toolchain/mfc/sched.py [56-68] try: threadHolder.thread.join(timeout=30.0) # 30 second timeout - - # Double-check that thread actually finished joining - if threadHolder.thread.is_alive(): - # Thread didn't finish within timeout - this is a serious issue - raise Exception(f"Thread {threadID} failed to join within 30 seconds timeout. " - f"Thread may be hung or in an inconsistent state.") - except Exception as join_exc: # Handle join-specific exceptions with more context raise Exception(f"Failed to join thread {threadID}: {join_exc}. " f"This may indicate a system threading issue or hung test case.") +# Double-check that thread actually finished joining +if threadHolder.thread.is_alive(): + # Thread didn't finish within timeout - this is a serious issue + raise Exception(f"Thread {threadID} failed to join within 30 seconds timeout. " + f"Thread may be hung or in an inconsistent state.") + Suggestion importance[1-10]: 8 __ Why: The suggestion correctly identifies that the `try...except` block improperly catches the timeout exception it raises, which would lead to a confusing and incorrect error message.	Medium
General	Use else clause for completion flag The `completed_successfully` flag is set to True only when no exception occurs, but it's placed inside the try block after the target execution. This creates a race condition where the flag might not be set if an exception occurs during the flag assignment itself. toolchain/mfc/sched.py [16-23] try: if self._target: self._target(self._args, *self._kwargs) - self.completed_successfully = True # Mark as completed successfully except Exception as exc: self.exc = exc # Store the full traceback for better error reporting self.exc_info = traceback.format_exc() +else: + # Only set completion flag if no exception occurred + self.completed_successfully = True Suggestion importance[1-10]: 6 __ Why: The suggestion correctly proposes using a `try...except...else` block, which is a more robust and idiomatic way to set the `completed_successfully` flag only upon successful execution.	Low
Possible issue	Fix contradictory exception handling logic The logic for handling exceptions when `completed_successfully` is True is flawed. If an exception occurred but the thread is marked as completed successfully, this indicates a bug in the completion tracking logic. Both conditions cannot be true simultaneously. toolchain/mfc/sched.py [73-84] if threadHolder.thread.exc is not None: - if threadHolder.thread.completed_successfully: - # Test framework handled the exception gracefully (e.g., test failure) - # Don't re-raise - this is expected behavior - pass + # Always propagate exceptions - if there's an exception, the thread didn't complete successfully + if hasattr(threadHolder.thread, 'exc_info') and threadHolder.thread.exc_info: + error_msg = f"Worker thread {threadID} failed with unhandled exception:\n{threadHolder.thread.exc_info}" + raise Exception(error_msg) from threadHolder.thread.exc else: - # Unhandled exception - this indicates a real problem - if hasattr(threadHolder.thread, 'exc_info') and threadHolder.thread.exc_info: - error_msg = f"Worker thread {threadID} failed with unhandled exception:\n{threadHolder.thread.exc_info}" - raise Exception(error_msg) from threadHolder.thread.exc - else: - raise threadHolder.thread.exc + raise threadHolder.thread.exc Suggestion importance[1-10]: 7 __ Why: The suggestion correctly identifies that `threadHolder.thread.exc` and `threadHolder.thread.completed_successfully` cannot be simultaneously true, making the conditional logic flawed and containing unreachable code.	Medium
Update

Copilot

Pull Request Overview

This PR ensures that Slurm-based test runs abort promptly on failures and surface error details. It captures full stack traces in worker threads, enforces a join timeout to avoid hangs, and propagates unhandled thread exceptions.

Add traceback import and store exc_info in WorkerThread.
Implement a 30-second join timeout and improved exception propagation in join_first_dead_thread.
Move the final thread-wait loop inside the progress context for consistent cleanup.

Comments suppressed due to low confidence (1)

toolchain/mfc/sched.py:57

There’s no test simulating a hung thread or validating the join-timeout path; consider adding a unit test that forces a thread to exceed the timeout to confirm this behavior and error propagation.

                    threadHolder.thread.join(timeout=30.0)  # 30 second timeout

toolchain/mfc/sched.py

codecov · 2025-07-08T15:54:34Z

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 43.71%. Comparing base (adcc0dd) to head (253ac51).
Report is 2 commits behind head on master.

Additional details and impacted files

@@           Coverage Diff           @@
##           master     #925   +/-   ##
=======================================
  Coverage   43.71%   43.71%           
=======================================
  Files          68       68           
  Lines       18360    18360           
  Branches     2292     2292           
=======================================
  Hits         8026     8026           
  Misses       8945     8945           
  Partials     1389     1389

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

fix slurm hangs when a job fails

10dd27a

Copilot AI review requested due to automatic review settings July 8, 2025 13:13

sbryngelson requested a review from a team as a code owner July 8, 2025 13:13

qodo-merge-pro bot added the Review effort 3/5 label Jul 8, 2025

Copilot AI reviewed Jul 8, 2025

View reviewed changes

toolchain/mfc/sched.py Show resolved Hide resolved

toolchain/mfc/sched.py Show resolved Hide resolved

toolchain/mfc/sched.py Outdated Show resolved Hide resolved

fix lint

7dfabd8

sbryngelson added 2 commits July 12, 2025 10:10

Merge branch 'master' into fix-slurm-hangs

466d48c

Merge branch 'master' into fix-slurm-hangs

253ac51

sbryngelson enabled auto-merge (squash) July 14, 2025 14:24

sbryngelson disabled auto-merge July 14, 2025 14:37

sbryngelson merged commit 568a837 into MFlowCode:master Jul 14, 2025
30 of 31 checks passed

sbryngelson deleted the fix-slurm-hangs branch July 15, 2025 17:59

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Fix non-abort on slurm tests #925

Fix non-abort on slurm tests #925

Uh oh!

sbryngelson commented Jul 8, 2025 •

edited by qodo-merge-pro bot

Loading

Uh oh!

qodo-merge-pro bot commented Jul 8, 2025

Uh oh!

qodo-merge-pro bot commented Jul 8, 2025 •

edited

Loading

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

codecov bot commented Jul 8, 2025 •

edited

Loading

Uh oh!

Uh oh!

Uh oh!

Fix non-abort on slurm tests #925

Fix non-abort on slurm tests #925

Uh oh!

Conversation

sbryngelson commented Jul 8, 2025 • edited by qodo-merge-pro bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

User description

PR Type

Description

Changes diagram

Changes walkthrough 📝

Uh oh!

qodo-merge-pro bot commented Jul 8, 2025

PR Reviewer Guide 🔍

Uh oh!

qodo-merge-pro bot commented Jul 8, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

PR Code Suggestions ✨

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull Request Overview

Uh oh!

Uh oh!

Uh oh!

Uh oh!

codecov bot commented Jul 8, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

Uh oh!

Uh oh!

sbryngelson commented Jul 8, 2025 •

edited by qodo-merge-pro bot

Loading

qodo-merge-pro bot commented Jul 8, 2025 •

edited

Loading

codecov bot commented Jul 8, 2025 •

edited

Loading