-
Notifications
You must be signed in to change notification settings - Fork 110
Fix non-abort on slurm tests #925
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
PR Reviewer Guide 🔍Here are some key observations to aid the review process:
|
PR Code Suggestions ✨Explore these optional code suggestions:
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull Request Overview
This PR ensures that Slurm-based test runs abort promptly on failures and surface error details. It captures full stack traces in worker threads, enforces a join timeout to avoid hangs, and propagates unhandled thread exceptions.
- Add
traceback
import and storeexc_info
inWorkerThread
. - Implement a 30-second join timeout and improved exception propagation in
join_first_dead_thread
. - Move the final thread-wait loop inside the progress context for consistent cleanup.
Comments suppressed due to low confidence (1)
toolchain/mfc/sched.py:57
- There’s no test simulating a hung thread or validating the join-timeout path; consider adding a unit test that forces a thread to exceed the timeout to confirm this behavior and error propagation.
threadHolder.thread.join(timeout=30.0) # 30 second timeout
Codecov ReportAll modified and coverable lines are covered by tests ✅
Additional details and impacted files@@ Coverage Diff @@
## master #925 +/- ##
=======================================
Coverage 43.71% 43.71%
=======================================
Files 68 68
Lines 18360 18360
Branches 2292 2292
=======================================
Hits 8026 8026
Misses 8945 8945
Partials 1389 1389 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
User description
Some GH self-hosted runners (slurm cases) will not abort to the slurm job after running the tests (they also don't print the failed cases). As a result, the job keeps going until the wall time max is reached and failed tests are never printed. This should fix that.
PR Type
Bug fix
Description
Fix thread hanging issues in SLURM test execution
Add proper thread joining with timeout mechanism
Improve exception handling and error reporting
Track thread completion status for better debugging
Changes diagram
Changes walkthrough 📝
sched.py
Enhanced thread management and exception handling
toolchain/mfc/sched.py
WorkerThread
class with completion tracking and fullexception info
errors