Skip to content

Fix non-abort on slurm tests #925

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 4 commits into from
Jul 14, 2025

Conversation

sbryngelson
Copy link
Member

@sbryngelson sbryngelson commented Jul 8, 2025

User description

Some GH self-hosted runners (slurm cases) will not abort to the slurm job after running the tests (they also don't print the failed cases). As a result, the job keeps going until the wall time max is reached and failed tests are never printed. This should fix that.


PR Type

Bug fix


Description

  • Fix thread hanging issues in SLURM test execution

  • Add proper thread joining with timeout mechanism

  • Improve exception handling and error reporting

  • Track thread completion status for better debugging


Changes diagram

flowchart LR
  A["WorkerThread"] --> B["Enhanced Exception Tracking"]
  B --> C["Thread Join with Timeout"]
  C --> D["Completion Status Check"]
  D --> E["Prevent SLURM Hangs"]
Loading

Changes walkthrough 📝

Relevant files
Bug fix
sched.py
Enhanced thread management and exception handling               

toolchain/mfc/sched.py

  • Enhanced WorkerThread class with completion tracking and full
    exception info
  • Added timeout-based thread joining to prevent infinite hangs
  • Improved exception handling to distinguish test failures from system
    errors
  • Added proper error reporting with traceback information
  • +40/-9   

    Need help?
  • Type /help how to ... in the comments thread for any questions about Qodo Merge usage.
  • Check out the documentation for more information.
  • @Copilot Copilot AI review requested due to automatic review settings July 8, 2025 13:13
    @sbryngelson sbryngelson requested a review from a team as a code owner July 8, 2025 13:13
    Copy link

    qodo-merge-pro bot commented Jul 8, 2025

    PR Reviewer Guide 🔍

    Here are some key observations to aid the review process:

    ⏱️ Estimated effort to review: 3 🔵🔵🔵⚪⚪
    🧪 No relevant tests
    🔒 No security concerns identified
    ⚡ Recommended focus areas for review

    Logic Error

    The exception handling logic at lines 74-84 has a potential issue where completed_successfully is only set to True after the target function completes without exception, but the code checks this flag when an exception occurred. This creates a logical contradiction that may prevent proper error propagation.

    if threadHolder.thread.completed_successfully:
        # Test framework handled the exception gracefully (e.g., test failure)
        # Don't re-raise - this is expected behavior
        pass
    else:
        # Unhandled exception - this indicates a real problem
        if hasattr(threadHolder.thread, 'exc_info') and threadHolder.thread.exc_info:
            error_msg = f"Worker thread {threadID} failed with unhandled exception:\n{threadHolder.thread.exc_info}"
            raise Exception(error_msg) from threadHolder.thread.exc
        else:
            raise threadHolder.thread.exc
    Timeout Handling

    The 30-second timeout for thread joining may be too short for legitimate long-running tests, potentially causing false positives. The timeout value should be configurable or based on test characteristics rather than hardcoded.

        threadHolder.thread.join(timeout=30.0)  # 30 second timeout
    
        # Double-check that thread actually finished joining
        if threadHolder.thread.is_alive():
            # Thread didn't finish within timeout - this is a serious issue
            raise Exception(f"Thread {threadID} failed to join within 30 seconds timeout. "
                          f"Thread may be hung or in an inconsistent state.")
    
    except Exception as join_exc:
        # Handle join-specific exceptions with more context
        raise Exception(f"Failed to join thread {threadID}: {join_exc}. "
                      f"This may indicate a system threading issue or hung test case.")
    Exception Chaining

    The exception handling creates nested exceptions with potentially confusing error messages. The raise Exception(error_msg) from threadHolder.thread.exc pattern may obscure the original exception details that are important for debugging test failures.

    if hasattr(threadHolder.thread, 'exc_info') and threadHolder.thread.exc_info:
        error_msg = f"Worker thread {threadID} failed with unhandled exception:\n{threadHolder.thread.exc_info}"
        raise Exception(error_msg) from threadHolder.thread.exc
    else:
        raise threadHolder.thread.exc

    Copy link

    qodo-merge-pro bot commented Jul 8, 2025

    PR Code Suggestions ✨

    Explore these optional code suggestions:

    CategorySuggestion                                                                                                                                    Impact
    General
    Fix exception handling scope issue

    The exception handling catches all exceptions including the timeout exception
    raised in the if block, creating a confusing error message chain. The timeout
    check should be outside the try-except block to avoid catching its own
    exception.

    toolchain/mfc/sched.py [56-68]

     try:
         threadHolder.thread.join(timeout=30.0)  # 30 second timeout
    -    
    -    # Double-check that thread actually finished joining
    -    if threadHolder.thread.is_alive():
    -        # Thread didn't finish within timeout - this is a serious issue
    -        raise Exception(f"Thread {threadID} failed to join within 30 seconds timeout. "
    -                      f"Thread may be hung or in an inconsistent state.")
    -        
     except Exception as join_exc:
         # Handle join-specific exceptions with more context
         raise Exception(f"Failed to join thread {threadID}: {join_exc}. "
                       f"This may indicate a system threading issue or hung test case.")
     
    +# Double-check that thread actually finished joining
    +if threadHolder.thread.is_alive():
    +    # Thread didn't finish within timeout - this is a serious issue
    +    raise Exception(f"Thread {threadID} failed to join within 30 seconds timeout. "
    +                  f"Thread may be hung or in an inconsistent state.")
    +
    Suggestion importance[1-10]: 8

    __

    Why: The suggestion correctly identifies that the try...except block improperly catches the timeout exception it raises, which would lead to a confusing and incorrect error message.

    Medium
    Use else clause for completion flag

    The completed_successfully flag is set to True only when no exception occurs,
    but it's placed inside the try block after the target execution. This creates a
    race condition where the flag might not be set if an exception occurs during the
    flag assignment itself.

    toolchain/mfc/sched.py [16-23]

     try:
         if self._target:
             self._target(*self._args, **self._kwargs)
    -        self.completed_successfully = True  # Mark as completed successfully
     except Exception as exc:
         self.exc = exc
         # Store the full traceback for better error reporting
         self.exc_info = traceback.format_exc()
    +else:
    +    # Only set completion flag if no exception occurred
    +    self.completed_successfully = True
    Suggestion importance[1-10]: 6

    __

    Why: The suggestion correctly proposes using a try...except...else block, which is a more robust and idiomatic way to set the completed_successfully flag only upon successful execution.

    Low
    Possible issue
    Fix contradictory exception handling logic

    The logic for handling exceptions when completed_successfully is True is flawed.
    If an exception occurred but the thread is marked as completed successfully,
    this indicates a bug in the completion tracking logic. Both conditions cannot be
    true simultaneously.

    toolchain/mfc/sched.py [73-84]

     if threadHolder.thread.exc is not None:
    -    if threadHolder.thread.completed_successfully:
    -        # Test framework handled the exception gracefully (e.g., test failure)
    -        # Don't re-raise - this is expected behavior
    -        pass
    +    # Always propagate exceptions - if there's an exception, the thread didn't complete successfully
    +    if hasattr(threadHolder.thread, 'exc_info') and threadHolder.thread.exc_info:
    +        error_msg = f"Worker thread {threadID} failed with unhandled exception:\n{threadHolder.thread.exc_info}"
    +        raise Exception(error_msg) from threadHolder.thread.exc
         else:
    -        # Unhandled exception - this indicates a real problem
    -        if hasattr(threadHolder.thread, 'exc_info') and threadHolder.thread.exc_info:
    -            error_msg = f"Worker thread {threadID} failed with unhandled exception:\n{threadHolder.thread.exc_info}"
    -            raise Exception(error_msg) from threadHolder.thread.exc
    -        else:
    -            raise threadHolder.thread.exc
    +        raise threadHolder.thread.exc
    Suggestion importance[1-10]: 7

    __

    Why: The suggestion correctly identifies that threadHolder.thread.exc and threadHolder.thread.completed_successfully cannot be simultaneously true, making the conditional logic flawed and containing unreachable code.

    Medium
    • Update

    Copy link
    Contributor

    @Copilot Copilot AI left a comment

    Choose a reason for hiding this comment

    The reason will be displayed to describe this comment to others. Learn more.

    Pull Request Overview

    This PR ensures that Slurm-based test runs abort promptly on failures and surface error details. It captures full stack traces in worker threads, enforces a join timeout to avoid hangs, and propagates unhandled thread exceptions.

    • Add traceback import and store exc_info in WorkerThread.
    • Implement a 30-second join timeout and improved exception propagation in join_first_dead_thread.
    • Move the final thread-wait loop inside the progress context for consistent cleanup.
    Comments suppressed due to low confidence (1)

    toolchain/mfc/sched.py:57

    • There’s no test simulating a hung thread or validating the join-timeout path; consider adding a unit test that forces a thread to exceed the timeout to confirm this behavior and error propagation.
                        threadHolder.thread.join(timeout=30.0)  # 30 second timeout
    

    Copy link

    codecov bot commented Jul 8, 2025

    Codecov Report

    All modified and coverable lines are covered by tests ✅

    Project coverage is 43.71%. Comparing base (adcc0dd) to head (253ac51).
    Report is 2 commits behind head on master.

    Additional details and impacted files
    @@           Coverage Diff           @@
    ##           master     #925   +/-   ##
    =======================================
      Coverage   43.71%   43.71%           
    =======================================
      Files          68       68           
      Lines       18360    18360           
      Branches     2292     2292           
    =======================================
      Hits         8026     8026           
      Misses       8945     8945           
      Partials     1389     1389           

    ☔ View full report in Codecov by Sentry.
    📢 Have feedback on the report? Share it here.

    🚀 New features to boost your workflow:
    • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

    @sbryngelson sbryngelson enabled auto-merge (squash) July 14, 2025 14:24
    @sbryngelson sbryngelson disabled auto-merge July 14, 2025 14:37
    @sbryngelson sbryngelson merged commit 568a837 into MFlowCode:master Jul 14, 2025
    30 of 31 checks passed
    @sbryngelson sbryngelson deleted the fix-slurm-hangs branch July 15, 2025 17:59
    Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
    Development

    Successfully merging this pull request may close these issues.

    1 participant